About the job Senior Site Reliability Engineer
Jon Title:-Senior Site Reliability Engineer (SRE) / DevOps Engineer
Location: Pune - In Office
Experience: 4–8 Years
On-Call Rotation Required (24/7 Production Support)
About the Role
We are seeking a Senior Site Reliability Engineer (SRE) / DevOps Engineer who will be responsible for ensuring the reliability, scalability, security, and performance of our production systems across multi-cloud environments (AWS, GCP, Azure). This role combines strong DevOps automation expertise with true SRE ownership — including on-call participation, incident management, root cause analysis, reliability engineering, and proactive system improvements. The ideal candidate balances incident response and firefighting with long-term engineering improvements that reduce toil, improve SLAs, and strengthen system resilience.
Key Responsibilities
1.Incident Response & On-Call Ownership
Participate in 24/7 on-call rotation for production systems
Rapidly diagnose, mitigate, and resolve high-severity incidents
Lead Root Cause Analysis (RCA) and post-mortem documentation
Implement corrective and preventive measures to avoid recurrence
Maintain SLAs/SLOs and reduce Mean Time to Recovery (MTTR)
2.Reliability Engineering & System Hardening
Design and implement reliability improvements to increase availability and reduce system fragility
Engineer solutions to eliminate repetitive operational work (toil reduction)
Improve redundancy, failover strategies, and disaster recovery planning
Track and improve SRE metrics (availability, latency, error rates, capacity)
3. Infrastructure & Cloud Engineering (Multi-Cloud)
Manage and optimize infrastructure across:
AWS (EC2, S3, RDS, IAM, VPC, CloudWatch)
Google Cloud Platform (GCP) (Compute Engine, Cloud Storage, Cloud SQL, IAM, VPC) (Having GCP is a plus)
Microsoft Azure (Virtual Machines, Networking, Storage, Azure Monitor)
Administer and optimize Kubernetes clusters
Manage Helm deployments and containerized workloads
Implement Infrastructure as Code (Terraform preferred)
4. Monitoring, Observability & Performance Optimization
Design symptom-based alerting (user-impact driven monitoring)
Implement observability using:
Prometheus
Grafana
Datadog
AWS CloudWatch
Azure Monitor
Analyze system bottlenecks and optimize performance
Improve logging and distributed tracing practices
5.Good to have-AI & Cloud-Native Workloads (Value Add)
Support deployment of AI services on Azure (Azure AI Services, AI Foundry)
Assist in infrastructure for RAG (Retrieval-Augmented Generation) workloads
Ensure scalability and reliability of AI/ML systems in production
7. Security & Compliance
Apply cloud security best practices (IAM, network segmentation, secrets management)
Collaborate on vulnerability remediation
Support compliance requirements where applicable
Required Technical Skills
Core Engineering
Strong scripting/programming skills (Python, Bash; Go is a plus)
Deep understanding of Linux systems and networking fundamentals
Experience working in production environments with high uptime requirements
Cloud & Infrastructure
Hands-on experience with at least one major cloud platform (AWS/GCP/Azure)
Kubernetes and container orchestration experience
Infrastructure as Code (Terraform preferred)
Git-based workflows (GitHub / GitLab / Azure Repos)
Monitoring & Observability
Experience with Prometheus, Grafana, Datadog, or similar tools
Understanding of SLIs, SLOs, SLAs
Preferred Qualifications
Good to have in experience managing AI/ML workloads in cloud environments.
Familiarity with distributed systems architecture
Exposure to OpenSearch / ELK stack
Experience reducing operational toil through automation
Basic knowledge of C# (.NET environments) is a plus
What Were Looking For
Ownership mind-set — not just task execution
Calm under pressure during incidents
Strong debugging and analytical thinking skills
Ability to balance immediate incident response with long-term engineering improvements
Collaborative approach with development teams