Job Openings Site Reliability Engineer (SRE)

About the job Site Reliability Engineer (SRE)

Qiscus is an Agentic Customer Engagement Platform that helps businesses deliver excellent customer experience (CX) through scalable and reliable conversations. As our platform grows, reliability, availability, and automation become critical.

At Qiscus, a Site Reliability Engineer (SRE) is responsible for ensuring our systems are stable, observable, secure, and scalable by combining practices from infrastructure engineering, DevOps, and system reliability.

What You Will Do

  • Build, maintain, and optimize reliable and scalable infrastructure for Qiscus products
  • Develop automation tools and scripts to reduce operational toil and human error.
  • Design and improve troubleshooting, incident response, and maintenance procedures.
  • Implement and maintain monitoring, logging, and alerting systems.
  • Investigate, troubleshoot, and perform root cause analysis (RCA) for production issues.
  • Ensure high availability, performance, and security across multiple environments.
  • Manage deployment, configuration, and infrastructure for staging and production.
  • Collaborate with engineering teams during releases and incidents.
  • Participate in on-call rotation and incident handling.

What You Will Bring to the Role

  • Proven experience as a System Administrator, Infrastructure Engineer, System Engineer, Network Engineer, DevOps, SRE or similar role(s)
  • Strong foundation in Linux & Networking (CLI, Permissions, Services, TCP/UDP, DNS, HTTP/TLS, Load Balancing, Basic Firewall, Log Analysis, and Troubleshooting).
  • Experience with Cloud Infrastructure (AWS preferred): EC2, VPC, Security Groups, RDS and other AWS Services including cost optimization and resource sizing.
  • Hands-on experience with Docker & Kubernetes (Deployment, Scaling, Logs, Events, Helm, HPA, PDB, Ingress, Cluster Autoscaler or Karpenter are a plus).
  • Familiar with Observability & Monitoring Tools ( Prometheus, Grafana, Promtail, and Basic tracing concepts).
  • Strong in CI/CD & Automation, Bash Scripting, Troubleshooting pipelines (GitHub Actions and Jenkins), Infrastructure as Code (Terraform also Ansible) and GitOps (ArgoCD and Flux) are a plus.
  • Experience with realtime systems, security practices (Kubernetes and OS hardening and CIS) and workflow automation N8N and Slack integrations is a plus.
  • Strong problem-solving skills, good communication, and willingness to join on-call rotation.