Job Openings Site Reliability Engineer (SRE) - AI & Platform Focus

About the job Site Reliability Engineer (SRE) - AI & Platform Focus

Site Reliability Engineer (SRE) - AI & Platform Focus

Your Mission

DevOps & SRE Culture: Participate in building a DevOps/SRE culture and enable the transition to modern infrastructure management, including AI-augmented workflows.

Infrastructure for AI: Design, deploy, and manage specialized infrastructure for AI services, including the orchestration of LiteLLM proxies, Langfuse for observability, and vector databases.

IA-Driven IaC: Design and implement AI Agents to automate Infrastructure-as-Code (IaC) tasks, from automated PR reviews to self-healing infrastructure scripts.

API Management: Own the lifecycle and reliability of our API Gateway via Gravitee, ensuring secure, high-performance traffic management and rate-limiting.

Performance & Resilience: Drive adoption of self-healing patterns and chaos engineering. Implement resiliency patterns (circuit breakers, bulkheads) specifically for LLM API dependencies to manage latency and costs.

Reliability & SLOs: Influence standards for service-level objectives (SLOs) and automate the tracking of error budgets for both traditional microservices and AI-powered features.

Expert Support: Provide level-3 support, conduct post-mortems, and apply analytics on past incidents to predict and prevent future outages.

We Need You To Have

Kubernetes Mastery: Strong experience with K8s and Helm for deploying complex, multi-tier software systems.

AI Infrastructure (LLMOps): Hands-on experience or strong interest in managing AI-specific tools like LiteLLM (load balancing/caching) and Langfuse (tracing/evaluations).

Automation & Agents: Ability to design Agentic workflows to assist in infrastructure management and automate repetitive DevOps tasks.

API Management: Practical experience with Gravitee or similar API gateway solutions (definition, security, and monitoring).

Systems & Networks: Deep knowledge of Linux/Unix, scripting (Shell, Python), and network protocols.

CI/CD & Cloud: Proven track record with CI/CD pipelines and managing virtualized/containerized environments (Docker).

Data: Knowledge of generic SQL/NoSQL database concepts and operations.

Soft Skills: Effective communication in English, proactive problem-solving, and the ability to write high-quality technical documentation.

Our Architecture and Technology Stack

Cloud & Orchestration: AWS (EKS), Kubernetes, Helm, Docker.

AI & Observability: LiteLLM, Langfuse, Datadog, Prometheus, Grafana, Fluentd.

API Gateway: Gravitee.io.

Deployment & GitOps: ArgoCD, GitHub Actions.

Infrastructure as Code: Terraform, Ansible, SaltStack.

OS & Tools: Ubuntu, Debian, Jira, Confluence.

It Would Be Great If You Have

AI Research: Knowledge of how to optimize inference costs and latency for Large Language Models.

Blockchain: Knowledge of Web3 and Blockchain technology.

Autonomy: Ability to work effectively across multiple organizational and geographic boundaries with high autonomy.

Agility: A "quick learner" mindset to keep up with the fast-evolving AI and Cloud-Native ecosystem.