Job Openings G17 - DevOps Engineer

About the job G17 - DevOps Engineer

Role Overview:

You own day-2 service & runtime operations (availability, latency, incident response, release health, capacity, cost & compliance optimisation) for Litmus & Sentinel atop a managed EKS + IaC foundation. You turn operational signals (latency, error budgets, drift, saturation) into continuous improvement. Partner closely with the platform (EKS / Terraform) team, security, and data science to ensure resiliency and regulated data handling while reducing toil and configuration drift.

Job Responsibilities:

  • Design & own service observability usage model: ensure all service metrics, logs, traces flow into Elastic Cloud (authoritative); maintain dashboards & SLOs; evaluate pragmatic use of CloudWatch, AWS Managed Prometheus / Grafana for supplemental or fallback views.
  • Build proactive, noisereduced alerting and incident response playbooks; drive postincident RCA & remediation tracking (closure SLA).
  • Optimize service performance (profiling, caching layers, autoscaling heuristics, concurrency tuning) meeting latency & throughput targets.
  • Implement secure supply chain & runtime controls (image scanning, SBOM consumption, secrets management, TLS / mTLS) leveraging shared platform tooling.
  • Curate operational runbooks, golden dashboards, reliability readiness + production readiness checklists.
  • Integrate model / guardrail service telemetry (latency, queue depth, GPU/CPU utilization) into unified Elastic Cloud views.
  • Support compliance & audit evidence collection (access logs, config lineage, change histories) via automated evidence capture fed into Elastic.
  • Introduce configuration drift detection & policy-as-code guardrails (OPA / Kyverno) at the workload / namespace layer to enforce baseline controls.
  • Mentor engineers on production readiness, observability patterns, and operational excellence; evolve on-call playbooks.
  • Participate in (and improve) an equitable on-call rotation focusing on sustainable alert volumes & burnout prevention.

Qualifications:

  • 4+ years (or equivalent impact) in SRE / Production Ops / Platform / Reliability for SaaS or high-throughput services.
  • Working knowledge of AWS & Kubernetes (deployment, troubleshooting, networking concepts) sufficient to collaborate effectively with platform owners (not necessarily owning cluster upgrade orchestration).
  • Familiarity with Infrastructure as Code & GitOps (Terraform, Argo, etc.) to consume modules, review changes, and enforce policy.
  • Observability implementation & usage (metrics, logs, traces, profiling) with Elastic Cloud; understanding of Prometheus / OpenTelemetry concepts.
  • Proven on-call & incident management experience (triage, MTTR reduction, RCA authorship).
  • Scripting / automation in Python, Bash, or Go for ops tooling.
  • Security & compliance aware: vulnerability management, image scanning, supply chain controls.
  • Clear, concise communication of operational risk & trade-offs to technical + non-technical stakeholders.

Preferred / Bonus Qualifications:

  • Progressive delivery (Argo Rollouts / Flagger) or service mesh (Istio / Linkerd) integration experience.
  • Cost optimisation track record (e.g. 20% reduction without SLO regression) & sustainability metrics (kWh / request).
  • Policy-as-code (OPA / Kyverno) & compliance-as-code implementations.
  • Disaster recovery game day facilitation / chaos engineering tooling.
  • Familiarity with data / audit frameworks in public sector contexts.