About the job Senior Site Reliability Engineer
About the Role
We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability, performance, and security. You'll define and maintain SLOs, build incident response systems, manage capacity across our distributed GPU network, and implement secure rollout/rollback mechanisms.
Requirements
- Experience in site reliability engineering, including working with SLOs and SLAs for production systems
- Experience with capacity planning and resource management for distributed systems
- Experience with incident response, on-call rotations, and post-mortem processes
- Experience with deployment systems (e.g., canary deployments, feature flags, automated rollbacks)
- Experience with observability tools (e.g., Prometheus, Grafana, ELK stack, logging, tracing, alerting)
- Experience with infrastructure security (e.g., network segmentation, workload isolation, security hardening)
- Experience with secrets management and key management systems (KMS)
- Experience with compliance frameworks (e.g., SOC 2, ISO 27001)
- Experience debugging distributed systems
- Experience with infrastructure-as-code, configuration management, and CI/CD pipelines
Bonus Skills
- Experience operating GPU infrastructure, AI/ML platforms, or compute marketplaces at scale
- Knowledge of multi-tenancy security patterns, container security, and runtime security tools
- Experience with chaos engineering, fault injection, and resilience testing
- Experience building and operating systems with 99.9%+ SLA uptime requirements