Senior Site Reliability Engineer

Pune, Maharashtra, India

Or refer someone

Job Openings Senior Site Reliability Engineer

About the job Senior Site Reliability Engineer

Jon Title:-Senior Site Reliability Engineer (SRE) / DevOps Engineer

Location: Pune - In Office

Experience: 10+ plus years of experience

On-Call Rotation Required (24/7 Production Support)

About the Role

We are seeking a Senior Site Reliability Engineer (SRE) / DevOps Engineer who will be responsible for ensuring the reliability, scalability, security, and performance of our production systems across multi-cloud environments (AWS, GCP, Azure). This role combines strong DevOps automation expertise with true SRE ownership — including on-call participation, incident management, root cause analysis, reliability engineering, and proactive system improvements. The ideal candidate balances incident response and firefighting with long-term engineering improvements that reduce toil, improve SLAs, and strengthen system resilience.

Key Responsibilities

1.Incident Response & On-Call Ownership

Participate in 24/7 on-call rotation for production systems

Rapidly diagnose, mitigate, and resolve high-severity incidents

Lead Root Cause Analysis (RCA) and post-mortem documentation

Implement corrective and preventive measures to avoid recurrence

Maintain SLAs/SLOs and reduce Mean Time to Recovery (MTTR)

2.Reliability Engineering & System Hardening

Design and implement reliability improvements to increase availability and reduce system fragility

Engineer solutions to eliminate repetitive operational work (toil reduction)

Improve redundancy, failover strategies, and disaster recovery planning

Track and improve SRE metrics (availability, latency, error rates, capacity)

3. Infrastructure & Cloud Engineering (Multi-Cloud)

Manage and optimize infrastructure across:

AWS (EC2, S3, RDS, IAM, VPC, CloudWatch)

Google Cloud Platform (GCP) (Compute Engine, Cloud Storage, Cloud SQL, IAM, VPC) (Having GCP is a plus)

Microsoft Azure (Virtual Machines, Networking, Storage, Azure Monitor)

Administer and optimize Kubernetes clusters

Manage Helm deployments and containerized workloads

Implement Infrastructure as Code (Terraform preferred)

4. Monitoring, Observability & Performance Optimization

Design symptom-based alerting (user-impact driven monitoring)

Implement observability using:

Prometheus

Grafana

Datadog

AWS CloudWatch

Azure Monitor

Analyze system bottlenecks and optimize performance

Improve logging and distributed tracing practices

5.Good to have-AI & Cloud-Native Workloads (Value Add)

Support deployment of AI services on Azure (Azure AI Services, AI Foundry)

Assist in infrastructure for RAG (Retrieval-Augmented Generation) workloads

Ensure scalability and reliability of AI/ML systems in production

7. Security & Compliance

Apply cloud security best practices (IAM, network segmentation, secrets management)

Collaborate on vulnerability remediation

Support compliance requirements where applicable

Required Technical Skills

Core Engineering

Strong scripting/programming skills (Python, Bash; Go is a plus)

Deep understanding of Linux systems and networking fundamentals

Experience working in production environments with high uptime requirements

Cloud & Infrastructure

Hands-on experience with at least one major cloud platform (AWS/GCP/Azure)

Kubernetes and container orchestration experience

Infrastructure as Code (Terraform preferred)

Git-based workflows (GitHub / GitLab / Azure Repos)

Monitoring & Observability

Experience with Prometheus, Grafana, Datadog, or similar tools

Understanding of SLIs, SLOs, SLAs

Preferred Qualifications

Good to have in experience managing AI/ML workloads in cloud environments.

Familiarity with distributed systems architecture

Exposure to OpenSearch / ELK stack

Experience reducing operational toil through automation

Basic knowledge of C# (.NET environments) is a plus

What Were Looking For

Ownership mind-set — not just task execution

Calm under pressure during incidents

Strong debugging and analytical thinking skills

Ability to balance immediate incident response with long-term engineering improvements

Collaborative approach with development teams

Or refer someone