About the job Site Reliability Engineer (SRE) - AI & Platform Focus
Site Reliability Engineer (SRE) - AI & Platform Focus
Your Mission
DevOps & SRE Culture: Participate in building a DevOps/SRE culture and enable the transition to modern infrastructure management, including AI-augmented workflows.
Infrastructure for AI: Design, deploy, and manage specialized infrastructure for AI services, including the orchestration of LiteLLM proxies, Langfuse for observability, and vector databases.
IA-Driven IaC: Design and implement AI Agents to automate Infrastructure-as-Code (IaC) tasks, from automated PR reviews to self-healing infrastructure scripts.
API Management: Own the lifecycle and reliability of our API Gateway via Gravitee, ensuring secure, high-performance traffic management and rate-limiting.
Performance & Resilience: Drive adoption of self-healing patterns and chaos engineering. Implement resiliency patterns (circuit breakers, bulkheads) specifically for LLM API dependencies to manage latency and costs.
Reliability & SLOs: Influence standards for service-level objectives (SLOs) and automate the tracking of error budgets for both traditional microservices and AI-powered features.
Expert Support: Provide level-3 support, conduct post-mortems, and apply analytics on past incidents to predict and prevent future outages.
We Need You To Have
Kubernetes Mastery: Strong experience with K8s and Helm for deploying complex, multi-tier software systems.
AI Infrastructure (LLMOps): Hands-on experience or strong interest in managing AI-specific tools like LiteLLM (load balancing/caching) and Langfuse (tracing/evaluations).
Automation & Agents: Ability to design Agentic workflows to assist in infrastructure management and automate repetitive DevOps tasks.
API Management: Practical experience with Gravitee or similar API gateway solutions (definition, security, and monitoring).
Systems & Networks: Deep knowledge of Linux/Unix, scripting (Shell, Python), and network protocols.
CI/CD & Cloud: Proven track record with CI/CD pipelines and managing virtualized/containerized environments (Docker).
Data: Knowledge of generic SQL/NoSQL database concepts and operations.
Soft Skills: Effective communication in English, proactive problem-solving, and the ability to write high-quality technical documentation.
Our Architecture and Technology Stack
Cloud & Orchestration: AWS (EKS), Kubernetes, Helm, Docker.
AI & Observability: LiteLLM, Langfuse, Datadog, Prometheus, Grafana, Fluentd.
API Gateway: Gravitee.io.
Deployment & GitOps: ArgoCD, GitHub Actions.
Infrastructure as Code: Terraform, Ansible, SaltStack.
OS & Tools: Ubuntu, Debian, Jira, Confluence.
It Would Be Great If You Have
AI Research: Knowledge of how to optimize inference costs and latency for Large Language Models.
Blockchain: Knowledge of Web3 and Blockchain technology.
Autonomy: Ability to work effectively across multiple organizational and geographic boundaries with high autonomy.
Agility: A "quick learner" mindset to keep up with the fast-evolving AI and Cloud-Native ecosystem.