About the job Site Reliability Engineer
Company Description
Aqilea is an IT and engineering consulting partner that helps companies get more out of their technology and operations. With teams in Stockholm and Bangalore, we work closely with our clients to build solutions that fit their needs - from software development, AI and infrastructure engineering to industrial automation and embedded systems.
We combine strong technical expertise with a practical, business-focused approach to help organizations modernize, improve security, and scale with confidence. Above all, we focus on long-term partnerships built on trust, quality, and real results.
With us, you have great opportunities to take real steps in your career and the opportunity to take great responsibility.
Site Reliability Engineer (SRE)
We are looking for a highly motivated and experienced Site Reliability Engineer (SRE) to join our growing engineering team in Bangalore. The ideal candidate will be responsible for ensuring the reliability, scalability, performance, and operational excellence of our cloud-native ecommerce platforms and applications.
Key Responsibilities
- Collaborate with cross-functional product teams to ensure high availability, stability, and reliability of production systems.
- Monitor, troubleshoot, and resolve complex production incidents and performance issues.
- Perform root cause analysis (RCA) and implement preventive measures to avoid recurring issues.
- Build and enhance monitoring, alerting, and observability solutions using tools such as Grafana, Splunk, and related platforms.
- Define and maintain SRE metrics including SLIs, SLOs, and Error Budgets.
- Automate operational workflows, deployment processes, and housekeeping activities to improve efficiency and reduce manual effort.
- Develop and maintain CI/CD pipelines using GitHub Actions and DevOps best practices.
- Provision and manage infrastructure using Infrastructure as Code (IaC) tools such as Terraform and Ansible.
- Support and manage cloud-native environments on Azure and GCP platforms.
- Work with Kubernetes platforms including AKS and GKE for container orchestration and deployment management.
- Collaborate with development teams to improve application reliability, scalability, and performance.
- Participate in on-call rotations and provide technical support for business-critical production incidents.
- Continuously evaluate and recommend improvements in tools, processes, and operational practices.
Required Skills & Qualifications
- 5–10 years of experience in Site Reliability Engineering, DevOps, Production Support, or related areas.
- Strong experience in production operations, incident management, and troubleshooting distributed systems.
- Hands-on experience with Kubernetes platforms such as AKS and GKE.
- Strong understanding of cloud platforms including Azure and GCP.
- Experience with CI/CD pipeline development and automation using GitHub Actions.
- Proficiency in Infrastructure as Code (Terraform/Ansible).
- Experience with monitoring and logging tools such as Grafana, Splunk, Prometheus, or similar.
- Good understanding of microservices architecture and API-driven systems.
- Experience with ITSM processes and tools such as ServiceNow.
- Familiarity with SRE principles including SLI, SLO, and Error Budget concepts.
- Strong scripting or programming skills in at least one language such as Python, Go, Java, Ruby, or C#.
- Excellent analytical, troubleshooting, and problem-solving skills.
Start: Immediate to 15 Days