Job Openings XTN-E810728 | SR. DEVOPS ENGINEER

About the job XTN-E810728 | SR. DEVOPS ENGINEER

.About the Role
We are looking for a DevOps Engineer to own and evolve the infrastructure that powers Anervea's AI platform and supporting services. You will be responsible for designing, deploying, and maintaining a reliable, secure, and cost-efficient cloud environment on AWS, with Docker-based workloads at its core.
This is a hands-on role for someone who enjoys the full lifecycle of infrastructure work — from provisioning and automation to monitoring, incident response, and continuous improvement. You will partner closely with engineering, AI/ML, and product teams to ship reliably and scale gracefully.

•  Health Insurance/HMO 
•  Enjoy unlimited MadMax Coffee
•  Diverse learning & growth opportunities
•  Accessible Cloud HR platform (Sprout)
•  Above standard leaves

Tech Stack You’ll Support
You will be deploying, scaling, and maintaining infrastructure for the following stack:

  • Mobile applications built with Flutter.
  • Web frontend built with React.
  • Backend services built with Python (FastAPI).
  • AI/ML workloads including LLM inference, RAG pipelines, and supporting services.
  • Cloud platform: AWS (primary).
  • Containerization with Docker; orchestration via ECS or EKS.
  • Databases and storage: PostgreSQL/RDS, S3, vector stores, and caching layers (Redis).

    Key Responsibilities
  • AWS Infrastructure
  • Design, provision, and maintain AWS infrastructure across services including EC2, ECS/EKS, S3, RDS, VPC, IAM, CloudFront, Route 53, ELB/ALB, Lambda, and CloudWatch.
  • Implement Infrastructure as Code (IaC) using Terraform, CloudFormation, or AWS CDK to ensure repeatable and version-controlled environments.
  • Manage networking (VPCs, subnets, security groups, NAT gateways, VPN/peering) with a strong focus on security and least-privilege access.
  • Optimize AWS spend through right-sizing, reserved instances/savings plans, autoscaling, and continuous cost monitoring.
  • Docker & Containerization
  • Build, optimize, and maintain Docker images for backend services, AI/ML workloads, and supporting tooling.
  • Manage container orchestration on Amazon ECS (Fargate/EC2) or EKS (Kubernetes), including service definitions, task scaling, and rolling deployments.
  • Maintain private container registries (ECR) with tagging, scanning, and lifecycle policies.
  • Troubleshoot container-level issues across networking, storage, and runtime performance.
  • CI/CD & Automation
  • Build and maintain CI/CD pipelines using GitHub Actions, GitLab CI, Jenkins, or AWS CodePipeline / CodeBuild.
  • Automate testing, container builds, image scanning, and zero-downtime deployments to staging and production environments.
  • Standardize deployment workflows across services so engineers can ship safely and quickly.
  • Monitoring, Reliability & Incident Response
  • Implement and maintain observability across the stack using CloudWatch, Prometheus, Grafana, ELK/OpenSearch, Datadog, or similar.
  • Define and track SLOs/SLAs, set up actionable alerting, and reduce noise in on-call workflows.
  • Lead incident response, conduct root-cause analyses, and drive postmortems with clear follow-up actions.
  • Plan and test backup, disaster recovery, and high-availability strategies.
  • Security & Compliance
  • Apply security best practices across IAM, secrets management (AWS Secrets Manager / Parameter Store), encryption at rest and in transit, and network security.
  • Manage SSL/TLS certificates, WAF rules, and DDoS protection.
  • Support compliance, audit, and data-protection requirements relevant to healthcare/AI workloads (e.g., data residency, access logging).
  • Generative AI Infrastructure
  • Support and maintain infrastructure for LLM and generative AI workloads on AWS, including Amazon Bedrock, SageMaker, and self-hosted model endpoints.
  • Help manage GPU-based compute (EC2 P/G instances, Inferentia) for model inference and training workloads, with a focus on cost and throughput optimization.
  • Set up and operate vector databases (OpenSearch, pgvector, Pinecone) and supporting RAG infrastructure.
  • Build secure, observable pipelines for model deployment, versioning, and rollback.
  • Monitor token usage, inference latency, and model-serving costs; implement guardrails and rate limiting where needed.
  • Manage API gateways and authentication for AI endpoints exposed to internal and external consumers.
  • Collaboration
  • Work closely with backend, frontend, and AI/ML engineers to understand workload requirements and provide reliable infrastructure.
  • Document infrastructure, runbooks, and operational procedures so the team can operate confidently.
  • Mentor engineers on DevOps practices, deployment hygiene, and cloud-cost awareness.

Required Qualifications

  • 3-6 years of professional experience in DevOps, SRE, Cloud, or Platform Engineering roles.
  • Strong hands-on experience with AWS across compute, networking, storage, and managed services.
  • Solid expertise in Docker — image building, multi-stage builds, and container runtime troubleshooting.
  • Production experience with at least one container orchestrator (ECS, EKS, or Kubernetes).
  • Proficiency in at least one IaC tool (Terraform preferred; CloudFormation or AWS CDK acceptable).
  • Strong scripting skills in Bash and Python (or Go) for automation and tooling.
  • Working knowledge of Linux administration, networking fundamentals (DNS, TCP/IP, HTTP/S, load balancing), and Git-based workflows.
  • Comfort deploying and operating applications across mobile (Flutter), web (React), and Python backend (FastAPI) stacks — you don’t need to write the code, but you should understand how to build, ship, and troubleshoot it.
  • Hands-on experience setting up and maintaining CI/CD pipelines.
  • Experience with monitoring, logging, and alerting tools in production environments.
  • Exposure to or genuine interest in operating generative AI / LLM workloads in production (Bedrock, SageMaker, self-hosted inference, or similar) — prior experience preferred but not required.


Nice to Have

  • AWS certifications (Solutions Architect, DevOps Engineer, SysOps Administrator).
  • Experience operating Kubernetes (EKS) at scale, including Helm and GitOps tools (ArgoCD, Flux).
  • Hands-on experience with AI/ML infrastructure at scale: GPU instance optimization, model serving frameworks (vLLM, Triton, TGI), fine-tuning pipelines, or production RAG systems.
  • Experience with service meshes (Istio, App Mesh) and API gateways.
  • Familiarity with multi-account AWS architectures (AWS Organizations, Control Tower, Landing Zone).
  • Experience with healthcare or regulated-data environments (HIPAA, GDPR, SOC 2).
  • Familiarity with NGINX, reverse proxies, and domain/SSL management workflows.

Additional relevant knowledge or experience related to the above requirements will be considered an advantage.