Site Reliability Engineer (DevOps, Linux)

Cyberjaya, Selangor, Malaysia

Job Openings Site Reliability Engineer (DevOps, Linux)

Our esteemed client, an established MNC, is searching for a Site Reliability Engineer:

Job Responsibilities

Oversee observability, capacity planning, issue analysis, and troubleshooting for large-scale, cloud-native applications in a microservices architecture.
Debug and automate routine tasks across operating systems, networks, databases, and application servers, leveraging programming skills beyond basic scripting.
Apply DevOps processes and programming knowledge in at least one of the following languages: Java, Python, or Go.
Utilize scripting tools such as Shell, Terraform, Ansible, Chef, or Puppet for automation and infrastructure management.
Possess deep expertise in Unix/Linux systems, virtual machines, containers, container management systems, enterprise cloud platforms, and data structures.
Manage the lifecycle of servicesfrom launch to deployment, operation, and optimization, ensuring reliability and a seamless user experience.
Monitor and enhance service reliability by measuring availability, latency, and system health while implementing sustainable incident response strategies.
Gather and analyze metrics to optimize performance and troubleshoot priority-level (P0/P1/P2/P3) issues.
Contribute to system design recommendations, platform management, and balancing feature development speed with reliability based on service level objectives.
Continuously measure and optimize system performance, anticipating and addressing potential user needs while driving innovation and improvements.

Job Requirements:

Bachelors degree or higher in Computer Science, Electronics & Communication, or a related field.
Minimum 2 years experience in related field.
Strong understanding of SRE principles and DevOps processes.
Exposure to data-driven decision-making and trend analysis.
Experience designing automation frameworks using SaltStack, Spinnaker, or StackStorm.
Managing large-scale big data clusters and optimizing data processing efficiency.
Knowledge of Chaos Engineering principles for system resilience testing.
Expertise in large-scale container management platforms with auto-scaling and intelligent scheduling.
Experience in big data analysis, data science, or large-scale data development.
Understanding of SIEM (Security Information and Event Management), threat modeling, and vulnerability detection.
Hands-on experience in cloud services network design, policy creation, and performance tuning.
Proficiency in database consistency checks, slow query optimization, and middleware performance tuning for RDBMS, NoSQL, and distributed caches.

Additional Information:

For interested parties, kindly click on "APPLY NOW" or send in your resume in MS Word format to

*We regret that only shortlisted candidates will be notified*

TSTAR Recruit Pte Ltd| EA Licence No:22C1039| Co.Reg.No.202207088Z| EA Registration No.: R1767370 (SIA KAI SING)

Or refer someone