XTN-E810728 | SR. DEVOPS ENGINEER

Job Openings XTN-E810728 | SR. DEVOPS ENGINEER

About the job XTN-E810728 | SR. DEVOPS ENGINEER

.About the Role
We are looking for a DevOps Engineer to own and evolve the infrastructure that powers Anervea's AI platform and supporting services. You will be responsible for designing, deploying, and maintaining a reliable, secure, and cost-efficient cloud environment on AWS, with Docker-based workloads at its core.
This is a hands-on role for someone who enjoys the full lifecycle of infrastructure work — from provisioning and automation to monitoring, incident response, and continuous improvement. You will partner closely with engineering, AI/ML, and product teams to ship reliably and scale gracefully.

• Health Insurance/HMO
• Enjoy unlimited MadMax Coffee
• Diverse learning & growth opportunities
• Accessible Cloud HR platform (Sprout)
• Above standard leaves

Tech Stack You’ll Support
You will be deploying, scaling, and maintaining infrastructure for the following stack:

Mobile applications built with Flutter.
Web frontend built with React.
Backend services built with Python (FastAPI).
AI/ML workloads including LLM inference, RAG pipelines, and supporting services.
Cloud platform: AWS (primary).
Containerization with Docker; orchestration via ECS or EKS.
Databases and storage: PostgreSQL/RDS, S3, vector stores, and caching layers (Redis).

Key Responsibilities
AWS Infrastructure
Design, provision, and maintain AWS infrastructure across services including EC2, ECS/EKS, S3, RDS, VPC, IAM, CloudFront, Route 53, ELB/ALB, Lambda, and CloudWatch.
Implement Infrastructure as Code (IaC) using Terraform, CloudFormation, or AWS CDK to ensure repeatable and version-controlled environments.
Manage networking (VPCs, subnets, security groups, NAT gateways, VPN/peering) with a strong focus on security and least-privilege access.
Optimize AWS spend through right-sizing, reserved instances/savings plans, autoscaling, and continuous cost monitoring.
Docker & Containerization
Build, optimize, and maintain Docker images for backend services, AI/ML workloads, and supporting tooling.
Manage container orchestration on Amazon ECS (Fargate/EC2) or EKS (Kubernetes), including service definitions, task scaling, and rolling deployments.
Maintain private container registries (ECR) with tagging, scanning, and lifecycle policies.
Troubleshoot container-level issues across networking, storage, and runtime performance.
CI/CD & Automation
Build and maintain CI/CD pipelines using GitHub Actions, GitLab CI, Jenkins, or AWS CodePipeline / CodeBuild.
Automate testing, container builds, image scanning, and zero-downtime deployments to staging and production environments.
Standardize deployment workflows across services so engineers can ship safely and quickly.
Monitoring, Reliability & Incident Response
Implement and maintain observability across the stack using CloudWatch, Prometheus, Grafana, ELK/OpenSearch, Datadog, or similar.
Define and track SLOs/SLAs, set up actionable alerting, and reduce noise in on-call workflows.
Lead incident response, conduct root-cause analyses, and drive postmortems with clear follow-up actions.
Plan and test backup, disaster recovery, and high-availability strategies.
Security & Compliance
Apply security best practices across IAM, secrets management (AWS Secrets Manager / Parameter Store), encryption at rest and in transit, and network security.
Manage SSL/TLS certificates, WAF rules, and DDoS protection.
Support compliance, audit, and data-protection requirements relevant to healthcare/AI workloads (e.g., data residency, access logging).
Generative AI Infrastructure
Support and maintain infrastructure for LLM and generative AI workloads on AWS, including Amazon Bedrock, SageMaker, and self-hosted model endpoints.
Help manage GPU-based compute (EC2 P/G instances, Inferentia) for model inference and training workloads, with a focus on cost and throughput optimization.
Set up and operate vector databases (OpenSearch, pgvector, Pinecone) and supporting RAG infrastructure.
Build secure, observable pipelines for model deployment, versioning, and rollback.
Monitor token usage, inference latency, and model-serving costs; implement guardrails and rate limiting where needed.
Manage API gateways and authentication for AI endpoints exposed to internal and external consumers.
Collaboration
Work closely with backend, frontend, and AI/ML engineers to understand workload requirements and provide reliable infrastructure.
Document infrastructure, runbooks, and operational procedures so the team can operate confidently.
Mentor engineers on DevOps practices, deployment hygiene, and cloud-cost awareness.

Required Qualifications

3-6 years of professional experience in DevOps, SRE, Cloud, or Platform Engineering roles.
Strong hands-on experience with AWS across compute, networking, storage, and managed services.
Solid expertise in Docker — image building, multi-stage builds, and container runtime troubleshooting.
Production experience with at least one container orchestrator (ECS, EKS, or Kubernetes).
Proficiency in at least one IaC tool (Terraform preferred; CloudFormation or AWS CDK acceptable).
Strong scripting skills in Bash and Python (or Go) for automation and tooling.
Working knowledge of Linux administration, networking fundamentals (DNS, TCP/IP, HTTP/S, load balancing), and Git-based workflows.
Comfort deploying and operating applications across mobile (Flutter), web (React), and Python backend (FastAPI) stacks — you don’t need to write the code, but you should understand how to build, ship, and troubleshoot it.
Hands-on experience setting up and maintaining CI/CD pipelines.
Experience with monitoring, logging, and alerting tools in production environments.
Exposure to or genuine interest in operating generative AI / LLM workloads in production (Bedrock, SageMaker, self-hosted inference, or similar) — prior experience preferred but not required.

Nice to Have

AWS certifications (Solutions Architect, DevOps Engineer, SysOps Administrator).
Experience operating Kubernetes (EKS) at scale, including Helm and GitOps tools (ArgoCD, Flux).
Hands-on experience with AI/ML infrastructure at scale: GPU instance optimization, model serving frameworks (vLLM, Triton, TGI), fine-tuning pipelines, or production RAG systems.
Experience with service meshes (Istio, App Mesh) and API gateways.
Familiarity with multi-account AWS architectures (AWS Organizations, Control Tower, Landing Zone).
Experience with healthcare or regulated-data environments (HIPAA, GDPR, SOC 2).
Familiarity with NGINX, reverse proxies, and domain/SSL management workflows.

Additional relevant knowledge or experience related to the above requirements will be considered an advantage.

Or refer someone