Job Openings
Head of AI Engineering
About the job Head of AI Engineering
We're building the AI foundation for a company operating at billion-user scale. As Head of AI Engineering, you own everything that lets 200+ Data Scientists and MLEs ship models to production fast, reliably, and cost-efficiently.
If training is slow, serving is down, GPUs are idle, or models take 6 months to ship — that's on you. If we 10x ML iteration speed and cut infra costs 50%, that's also on you.
This is a 01 build: we have the mandate, budget, and headcount. We need you to set the vision and hire the team.
What You'll Own
1. ML Platform & MLOps – The Paved Road
- Training platform: distributed PyTorch/TF, hyperparameter tuning, experiment tracking, spot management
- Feature platform: real-time + batch feature store, online/offline consistency, lineage
- Model lifecycle: registry, CI/CD, canary deploys, rollbacks, A/B infra, shadow testing
- Monitoring: data drift, model decay, latency, cost attribution, SLOs for 99.99% uptime
- Self-serve tooling that makes the right thing the easy thing for 200+ DS/MLEs
2. GenAI Infrastructure
- LLM platform: model gateway, prompt management, eval harness, RAG infrastructure
- Inference optimization: vLLM, TensorRT-LLM, quantization, KV-cache tuning, multi-LoRA serving
- GPU orchestration: Kubernetes, Ray, large-scale scheduling, utilization >80%
- Guardrails & safety: red-teaming infra, content moderation, responsible AI tooling
- Build vs buy decisions: When to use OpenAI/Anthropic vs fine-tune vs train from scratch
3. Core Serving & Real-time ML
- Low-latency inference: 10ms p99 at 1M+ QPS across regions
- Online learning & real-time features: Flink, Kafka, Feast/Tecton patterns
- Edge ML and on-device optimization where needed
- Cost efficiency: Your team's optimizations save 8-figures annually
4. Team & Strategy
- Hire and lead a 30-60 person org: ML Platform, Serving, Feature Store, GenAI Infra, SRE for ML
- Set 3-year technical vision for AI infra. Review major RFCs and architecture decisions
- Partner with Head of Data Science to unblock model launches and define interfaces
- Partner with Data Eng on shared infrastructure: compute, storage, metadata
- Represent AI Engineering with execs, board, and externally. Own the narrative
What You'll Bring
Must-haves:
- 12+ YOE in software/infra engineering, with 5+ YOE leading ML Platform, AI Infra, or similar orgs of 25+ people
- You've personally scaled ML systems to 100k+ QPS and 1000+ GPUs. You know what breaks
- Deep expertise across the stack: Kubernetes, Python, C++/Go, distributed systems, CUDA
- Strong technical judgment: You've made build vs buy calls on feature stores, training platforms, LLM infra
- Still hands-on: Can dive deep on a p99 latency spike, review a design doc, debug NCCL hangs
- Track record hiring and growing Staff+ ICs and managers. People follow you
Nice-to-haves:
- Experience supporting ranking/recsys, ads, or large-scale consumer ML
- Shipped GenAI to prod: RAG, agents, fine-tuning at scale with real users
- Open source cred: Kubeflow, Ray, vLLM, etc. Or strong publications in ML systems
- Experience with multi-region, active-active ML serving
What Success Looks Like – Year 1
- Velocity: P50 time from notebook prod experiment < 2 weeks for 80% of use cases
- Reliability: Model serving SLO 99.99%. No SEV-0s caused by platform
- Efficiency: GPU utilization >75%. ML infra cost per prediction down 30%
- Adoption: 90% of DS/MLEs use the paved road vs building one-off solutions
- Team: Hired 4+ Staff+ ICs and 2+ managers. Team eNPS >80
Why This Role
- Greenfield: No legacy. You set the standards every AI company in SEA will copy in 5 years
- Scale: Problems you solve here don't exist at 99% of companies. 1ms saved = $millions
- Leverage: Your platform multiplies 200 people. Best ROI in the company