Job Openings
Senior Site Reliability Engineer
About the job Senior Site Reliability Engineer
Senior Site Reliability Engineer (SRE)
Job Brief:
We are seeking an experienced Senior Site Reliability Engineer (SRE) to join our team and play a critical role in ensuring the reliability, scalability, and performance of our systems. The ideal candidate will have a strong background in infrastructure management, automation, and a passion for optimizing and improving the reliability of mission-critical systems.
Responsibilities:
- Design, implement, and maintain highly available and scalable infrastructure solutions.
- Develop and maintain automated deployment, monitoring, and alerting systems to ensure system reliability and performance.
- Collaborate with development teams to design, implement, and maintain CI/CD pipelines for automated testing and deployment.
- Lead incident response and resolution efforts, including root cause analysis and post-incident reviews.
- Implement and enforce best practices for system security, including access controls, data encryption, and vulnerability management.
- Proactively identify performance bottlenecks and optimization opportunities in the infrastructure and application stack.
- Participate in capacity planning and resource allocation to ensure scalability and cost efficiency.
- Mentor junior engineers and provide technical guidance on best practices for reliability engineering.
Requirements:
- Bachelor's degree in Computer Science, Information Technology, or related field.
- 5+ years of experience in site reliability engineering, systems administration, or related field.
- Strong expertise in cloud computing platforms such as AWS, Azure, or Google Cloud Platform.
- Proficiency in infrastructure as code (IaC) tools such as Terraform, CloudFormation, or Ansible.
- Experience with container orchestration systems such as Kubernetes or Docker Swarm.
- Deep understanding of Linux/Unix systems administration and networking concepts.
- Familiarity with monitoring and observability tools such as Prometheus, Grafana, ELK Stack, or Datadog.
- Knowledge of scripting languages such as Python, Bash, or PowerShell.
- Strong problem-solving skills and ability to troubleshoot complex issues in production environments.
- Excellent communication and collaboration skills, with the ability to work effectively in a team environment.
Preferred Qualifications:
- Certification in cloud platforms such as AWS Certified Solutions Architect, Google Cloud Certified Professional Cloud Architect, or Azure Solutions Architect.
- Experience with service mesh technologies such as Istio or Linkerd.
- Knowledge of distributed systems design principles and microservices architecture.
- Familiarity with agile methodologies and DevOps practices.
- Contributions to open-source projects or participation in the SRE community.