Job Openings
Manager - Site Reliability Engineer
About the job Manager - Site Reliability Engineer
Overview
A dynamic and skilled Manager -SRE is required to drive reliability, performance, and operational excellence across critical systems and services. This role involves working closely with engineering teams to build scalable infrastructure, streamline processes, and ensure seamless service delivery. The ideal candidate will have strong troubleshooting skills, deep technical understanding, and leadership capability to guide SRE practices.
Key Responsibilities:
- Lead and manage a team of Site Reliability Engineers, providing guidance, mentorship, and support.
- Collaborate with cross-functional teams to define and implement
- strategies for improving system reliability, scalability, and performance.
- Monitor and analyze system performance metrics, identifying areas for improvement and implementing proactive solutions.
- Troubleshoot and resolve complex technical issues, ensuring minimal impact on system availability.
- Implement and maintain monitoring, alerting, and incident response systems.
- Develop and maintain documentation for system configurations, processes, and procedures.
- Stay up-to-date with industry trends and emerging technologies, recommending and implementing innovative solutions.
Job requirements
- Previous experience in a similar role, managing a team of Site Reliability Engineers.
- Strong knowledge of Kubernetes.
- Proficiency in scripting and automation using languages like Python, Bash, or PowerShell.
- Experience with monitoring and logging tools, such as Prometheus, Grafana, or ELK stack.
- Excellent problem-solving and troubleshooting skills.
- Strong communication and leadership abilities.