Job Openings Senior Site Reliability Engineer

About the job Senior Site Reliability Engineer

Senior Site Reliability Engineer (SRE)

Job Brief:

We are seeking an experienced Senior Site Reliability Engineer (SRE) to join our team and play a critical role in ensuring the reliability, scalability, and performance of our systems. The ideal candidate will have a strong background in infrastructure management, automation, and a passion for optimizing and improving the reliability of mission-critical systems.

Responsibilities:

  • Design, implement, and maintain highly available and scalable infrastructure solutions.
  • Develop and maintain automated deployment, monitoring, and alerting systems to ensure system reliability and performance.
  • Collaborate with development teams to design, implement, and maintain CI/CD pipelines for automated testing and deployment.
  • Lead incident response and resolution efforts, including root cause analysis and post-incident reviews.
  • Implement and enforce best practices for system security, including access controls, data encryption, and vulnerability management.
  • Proactively identify performance bottlenecks and optimization opportunities in the infrastructure and application stack.
  • Participate in capacity planning and resource allocation to ensure scalability and cost efficiency.
  • Mentor junior engineers and provide technical guidance on best practices for reliability engineering.

Requirements:

  • Bachelor's degree in Computer Science, Information Technology, or related field.
  • 5+ years of experience in site reliability engineering, systems administration, or related field.
  • Strong expertise in cloud computing platforms such as AWS, Azure, or Google Cloud Platform.
  • Proficiency in infrastructure as code (IaC) tools such as Terraform, CloudFormation, or Ansible.
  • Experience with container orchestration systems such as Kubernetes or Docker Swarm.
  • Deep understanding of Linux/Unix systems administration and networking concepts.
  • Familiarity with monitoring and observability tools such as Prometheus, Grafana, ELK Stack, or Datadog.
  • Knowledge of scripting languages such as Python, Bash, or PowerShell.
  • Strong problem-solving skills and ability to troubleshoot complex issues in production environments.
  • Excellent communication and collaboration skills, with the ability to work effectively in a team environment.

Preferred Qualifications:

  • Certification in cloud platforms such as AWS Certified Solutions Architect, Google Cloud Certified Professional Cloud Architect, or Azure Solutions Architect.
  • Experience with service mesh technologies such as Istio or Linkerd.
  • Knowledge of distributed systems design principles and microservices architecture.
  • Familiarity with agile methodologies and DevOps practices.
  • Contributions to open-source projects or participation in the SRE community.