Lead Site Reliability Engineer
Job Description:
My client are seeking a Site Reliability Engineer (SRE) to join their growing infrastructure team in Asia. As an SRE, you will play a critical role in ensuring the stability, scalability, and resilience of our production systems. You'll help our client scale our services, define and enforce reliability standards, and continuously improve our engineering infrastructure. You are eager to contribute to the digital asset and blockchain space, alongside driving innovation at the intersection of finance and technology and contributing to our mission of engineering a more open and accessible financial system through institutional-grade infrastructure and services.
What Youll Do:
- Ensure the reliability, uptime, and performance of critical infrastructure and services across our digital asset platform.
- Develop and maintain Infrastructure as Code (IaC) using Terraform, supporting reproducibility and automation of our environments.
- Maintain and operate Kubernetes-based containerized deployments, ensuring eƯicient scaling and fault tolerance.
- Manage and optimize our AWS cloud infrastructure, ensuring cost-eƯective and secure operations.
- Build, configure, and enhance monitoring and alerting pipelines using Datadog, enabling proactive detection and resolution of issues.
- Own and improve the CI/CD pipeline using Jenkins, JFrog, Flux, and GitHub
- Actions, ensuring fast, safe, and auditable deployments.
- Collaborate with security, platform, and product engineering teams to ensure high availability, disaster recovery, and incident response capabilities.
- Participate in an on-call rotation, incident retrospectives, and operational reviews to continually improve service delivery.
What Were Looking For:
- 5+ years of experience in an SRE, DevOps, or similar role in a high-availability production environment.
- Strong experience with Terraform and Kubernetes in managing and scaling infrastructure.
- Proficiency with AWS services including EC2, EKS, S3, RDS, and IAM.
- Hands-on experience with CI/CD tools like Jenkins, JFrog Artifactory, Flux, and GitHub workflows.
- Solid understanding of system monitoring, alerting, and performance management using tools like Datadog.
- Knowledge of scripting languages (e.g., Python, Bash) for automation tasks.
- Familiarity with GitOps practices and version-controlled infrastructure.
- Strong communication and collaboration skills to work across cross-functional teams.
Bonus Points:
- Experience working in a regulated financial environment or crypto-native infrastructure.
- Background in systems security, performance tuning, or cost optimization.
- Contributions to open-source projects or community involvement in the DevOps/SRE space.
Required Skills:
Reliability