Job Openings Head of AI Engineering

About the job Head of AI Engineering


We're building the AI foundation for a company operating at billion-user scale. As Head of AI Engineering, you own everything that lets 200+ Data Scientists and MLEs ship models to production fast, reliably, and cost-efficiently.

If training is slow, serving is down, GPUs are idle, or models take 6 months to ship — that's on you. If we 10x ML iteration speed and cut infra costs 50%, that's also on you.

This is a 01 build: we have the mandate, budget, and headcount. We need you to set the vision and hire the team.

What You'll Own

1. ML Platform & MLOps – The Paved Road

  • Training platform: distributed PyTorch/TF, hyperparameter tuning, experiment tracking, spot management
  • Feature platform: real-time + batch feature store, online/offline consistency, lineage
  • Model lifecycle: registry, CI/CD, canary deploys, rollbacks, A/B infra, shadow testing
  • Monitoring: data drift, model decay, latency, cost attribution, SLOs for 99.99% uptime
  • Self-serve tooling that makes the right thing the easy thing for 200+ DS/MLEs

2. GenAI Infrastructure

  • LLM platform: model gateway, prompt management, eval harness, RAG infrastructure
  • Inference optimization: vLLM, TensorRT-LLM, quantization, KV-cache tuning, multi-LoRA serving
  • GPU orchestration: Kubernetes, Ray, large-scale scheduling, utilization >80%
  • Guardrails & safety: red-teaming infra, content moderation, responsible AI tooling
  • Build vs buy decisions: When to use OpenAI/Anthropic vs fine-tune vs train from scratch

3. Core Serving & Real-time ML

  • Low-latency inference: 10ms p99 at 1M+ QPS across regions
  • Online learning & real-time features: Flink, Kafka, Feast/Tecton patterns
  • Edge ML and on-device optimization where needed
  • Cost efficiency: Your team's optimizations save 8-figures annually

4. Team & Strategy

  • Hire and lead a 30-60 person org: ML Platform, Serving, Feature Store, GenAI Infra, SRE for ML
  • Set 3-year technical vision for AI infra. Review major RFCs and architecture decisions
  • Partner with Head of Data Science to unblock model launches and define interfaces
  • Partner with Data Eng on shared infrastructure: compute, storage, metadata
  • Represent AI Engineering with execs, board, and externally. Own the narrative

What You'll Bring

Must-haves:

  • 12+ YOE in software/infra engineering, with 5+ YOE leading ML Platform, AI Infra, or similar orgs of 25+ people
  • You've personally scaled ML systems to 100k+ QPS and 1000+ GPUs. You know what breaks
  • Deep expertise across the stack: Kubernetes, Python, C++/Go, distributed systems, CUDA
  • Strong technical judgment: You've made build vs buy calls on feature stores, training platforms, LLM infra
  • Still hands-on: Can dive deep on a p99 latency spike, review a design doc, debug NCCL hangs
  • Track record hiring and growing Staff+ ICs and managers. People follow you

Nice-to-haves:

  • Experience supporting ranking/recsys, ads, or large-scale consumer ML
  • Shipped GenAI to prod: RAG, agents, fine-tuning at scale with real users
  • Open source cred: Kubeflow, Ray, vLLM, etc. Or strong publications in ML systems
  • Experience with multi-region, active-active ML serving

What Success Looks Like – Year 1

  1. Velocity: P50 time from notebook prod experiment < 2 weeks for 80% of use cases
  2. Reliability: Model serving SLO 99.99%. No SEV-0s caused by platform
  3. Efficiency: GPU utilization >75%. ML infra cost per prediction down 30%
  4. Adoption: 90% of DS/MLEs use the paved road vs building one-off solutions
  5. Team: Hired 4+ Staff+ ICs and 2+ managers. Team eNPS >80

Why This Role

  • Greenfield: No legacy. You set the standards every AI company in SEA will copy in 5 years
  • Scale: Problems you solve here don't exist at 99% of companies. 1ms saved = $millions
  • Leverage: Your platform multiplies 200 people. Best ROI in the company