About the job XTN-E810728 | SR. DEVOPS ENGINEER
.About the Role
We are looking for a DevOps Engineer to own and evolve the infrastructure that powers Anervea's AI platform and supporting services. You will be responsible for designing, deploying, and maintaining a reliable, secure, and cost-efficient cloud environment on AWS, with Docker-based workloads at its core.
This is a hands-on role for someone who enjoys the full lifecycle of infrastructure work — from provisioning and automation to monitoring, incident response, and continuous improvement. You will partner closely with engineering, AI/ML, and product teams to ship reliably and scale gracefully.
• Health Insurance/HMO
• Enjoy unlimited MadMax Coffee
• Diverse learning & growth opportunities
• Accessible Cloud HR platform (Sprout)
• Above standard leaves
Tech Stack You’ll Support
You will be deploying, scaling, and maintaining infrastructure for the following stack:
- Mobile applications built with Flutter.
- Web frontend built with React.
- Backend services built with Python (FastAPI).
- AI/ML workloads including LLM inference, RAG pipelines, and supporting services.
- Cloud platform: AWS (primary).
- Containerization with Docker; orchestration via ECS or EKS.
- Databases and storage: PostgreSQL/RDS, S3, vector stores, and caching layers (Redis).
Key Responsibilities - AWS Infrastructure
- Design, provision, and maintain AWS infrastructure across services including EC2, ECS/EKS, S3, RDS, VPC, IAM, CloudFront, Route 53, ELB/ALB, Lambda, and CloudWatch.
- Implement Infrastructure as Code (IaC) using Terraform, CloudFormation, or AWS CDK to ensure repeatable and version-controlled environments.
- Manage networking (VPCs, subnets, security groups, NAT gateways, VPN/peering) with a strong focus on security and least-privilege access.
- Optimize AWS spend through right-sizing, reserved instances/savings plans, autoscaling, and continuous cost monitoring.
- Docker & Containerization
- Build, optimize, and maintain Docker images for backend services, AI/ML workloads, and supporting tooling.
- Manage container orchestration on Amazon ECS (Fargate/EC2) or EKS (Kubernetes), including service definitions, task scaling, and rolling deployments.
- Maintain private container registries (ECR) with tagging, scanning, and lifecycle policies.
- Troubleshoot container-level issues across networking, storage, and runtime performance.
- CI/CD & Automation
- Build and maintain CI/CD pipelines using GitHub Actions, GitLab CI, Jenkins, or AWS CodePipeline / CodeBuild.
- Automate testing, container builds, image scanning, and zero-downtime deployments to staging and production environments.
- Standardize deployment workflows across services so engineers can ship safely and quickly.
- Monitoring, Reliability & Incident Response
- Implement and maintain observability across the stack using CloudWatch, Prometheus, Grafana, ELK/OpenSearch, Datadog, or similar.
- Define and track SLOs/SLAs, set up actionable alerting, and reduce noise in on-call workflows.
- Lead incident response, conduct root-cause analyses, and drive postmortems with clear follow-up actions.
- Plan and test backup, disaster recovery, and high-availability strategies.
- Security & Compliance
- Apply security best practices across IAM, secrets management (AWS Secrets Manager / Parameter Store), encryption at rest and in transit, and network security.
- Manage SSL/TLS certificates, WAF rules, and DDoS protection.
- Support compliance, audit, and data-protection requirements relevant to healthcare/AI workloads (e.g., data residency, access logging).
- Generative AI Infrastructure
- Support and maintain infrastructure for LLM and generative AI workloads on AWS, including Amazon Bedrock, SageMaker, and self-hosted model endpoints.
- Help manage GPU-based compute (EC2 P/G instances, Inferentia) for model inference and training workloads, with a focus on cost and throughput optimization.
- Set up and operate vector databases (OpenSearch, pgvector, Pinecone) and supporting RAG infrastructure.
- Build secure, observable pipelines for model deployment, versioning, and rollback.
- Monitor token usage, inference latency, and model-serving costs; implement guardrails and rate limiting where needed.
- Manage API gateways and authentication for AI endpoints exposed to internal and external consumers.
- Collaboration
- Work closely with backend, frontend, and AI/ML engineers to understand workload requirements and provide reliable infrastructure.
- Document infrastructure, runbooks, and operational procedures so the team can operate confidently.
- Mentor engineers on DevOps practices, deployment hygiene, and cloud-cost awareness.
Required Qualifications
- 3-6 years of professional experience in DevOps, SRE, Cloud, or Platform Engineering roles.
- Strong hands-on experience with AWS across compute, networking, storage, and managed services.
- Solid expertise in Docker — image building, multi-stage builds, and container runtime troubleshooting.
- Production experience with at least one container orchestrator (ECS, EKS, or Kubernetes).
- Proficiency in at least one IaC tool (Terraform preferred; CloudFormation or AWS CDK acceptable).
- Strong scripting skills in Bash and Python (or Go) for automation and tooling.
- Working knowledge of Linux administration, networking fundamentals (DNS, TCP/IP, HTTP/S, load balancing), and Git-based workflows.
- Comfort deploying and operating applications across mobile (Flutter), web (React), and Python backend (FastAPI) stacks — you don’t need to write the code, but you should understand how to build, ship, and troubleshoot it.
- Hands-on experience setting up and maintaining CI/CD pipelines.
- Experience with monitoring, logging, and alerting tools in production environments.
- Exposure to or genuine interest in operating generative AI / LLM workloads in production (Bedrock, SageMaker, self-hosted inference, or similar) — prior experience preferred but not required.
Nice to Have
- AWS certifications (Solutions Architect, DevOps Engineer, SysOps Administrator).
- Experience operating Kubernetes (EKS) at scale, including Helm and GitOps tools (ArgoCD, Flux).
- Hands-on experience with AI/ML infrastructure at scale: GPU instance optimization, model serving frameworks (vLLM, Triton, TGI), fine-tuning pipelines, or production RAG systems.
- Experience with service meshes (Istio, App Mesh) and API gateways.
- Familiarity with multi-account AWS architectures (AWS Organizations, Control Tower, Landing Zone).
- Experience with healthcare or regulated-data environments (HIPAA, GDPR, SOC 2).
- Familiarity with NGINX, reverse proxies, and domain/SSL management workflows.
Additional relevant knowledge or experience related to the above requirements will be considered an advantage.