Job Openings Observability Engineer - Infrastructure & Application Monitoring

About the job Observability Engineer - Infrastructure & Application Monitoring

Job Summary:

We are seeking a highly technical and proactive Observability Engineer to enhance monitoring, logging, and alerting capabilities across enterprise systems, applications, and cloud infrastructure. This role is central to improving system reliability, availability, and performance by providing visibility into complex distributed environments in financial or high-regulation industries.

The ideal candidate has experience implementing observability tools and best practices in cloud-native and hybrid environments, with a strong foundation in infrastructure monitoring, APM (Application Performance Monitoring), and log analytics.

Key Responsibilities:

  • Design, implement, and maintain observability frameworks across systems, services, and infrastructure.
  • Deploy and configure monitoring tools (e.g., Prometheus, Grafana, Datadog, New Relic, AppDynamics, ELK) to ensure visibility into application and system health.
  • Work closely with DevOps, infrastructure, and application teams to define SLOs, SLAs, and alerts for proactive incident management.
  • Develop dashboards, custom metrics, and synthetic monitoring for real-time performance insights.
  • Support root cause analysis and post-incident reviews with actionable monitoring data.
  • Automate alerting and anomaly detection to reduce mean time to detect (MTTD) and mean time to resolve (MTTR).
  • Establish observability standards and best practices across development and operations teams.
  • Provide support for integration of monitoring into CI/CD pipelines and deployment workflows.

Qualifications:

  • Bachelors degree in Computer Science, Information Technology, or related field.
  • 3-5 years of experience in systems monitoring, observability, or SRE within enterprise or cloud environments.
  • Hands-on experience with observability platforms like Grafana, Prometheus, ELK, Datadog, AppDynamics, or similar tools.
  • Solid understanding of cloud services (AWS, Azure, or GCP), container orchestration (Kubernetes), and microservice architectures.
  • Proficiency in scripting (Python, Bash, etc.) and infrastructure-as-code (Terraform, Ansible) is a plus.
  • Familiarity with log aggregation, APM tools, distributed tracing, and alerting configurations.
  • Strong analytical and troubleshooting skills, with an ability to interpret system metrics and logs effectively.
  • Excellent communication and collaboration skills in cross-functional DevOps/SRE teams.