Head of AI Engineering

Singapore, Singapore, Singapore

Job Openings Head of AI Engineering

About the job Head of AI Engineering

We're building the AI foundation for a company operating at billion-user scale. As Head of AI Engineering, you own everything that lets 200+ Data Scientists and MLEs ship models to production fast, reliably, and cost-efficiently.

If training is slow, serving is down, GPUs are idle, or models take 6 months to ship — that's on you. If we 10x ML iteration speed and cut infra costs 50%, that's also on you.

This is a 01 build: we have the mandate, budget, and headcount. We need you to set the vision and hire the team.

What You'll Own

1. ML Platform & MLOps – The Paved Road

Training platform: distributed PyTorch/TF, hyperparameter tuning, experiment tracking, spot management
Feature platform: real-time + batch feature store, online/offline consistency, lineage
Model lifecycle: registry, CI/CD, canary deploys, rollbacks, A/B infra, shadow testing
Monitoring: data drift, model decay, latency, cost attribution, SLOs for 99.99% uptime
Self-serve tooling that makes the right thing the easy thing for 200+ DS/MLEs

2. GenAI Infrastructure

LLM platform: model gateway, prompt management, eval harness, RAG infrastructure
Inference optimization: vLLM, TensorRT-LLM, quantization, KV-cache tuning, multi-LoRA serving
GPU orchestration: Kubernetes, Ray, large-scale scheduling, utilization >80%
Guardrails & safety: red-teaming infra, content moderation, responsible AI tooling
Build vs buy decisions: When to use OpenAI/Anthropic vs fine-tune vs train from scratch

3. Core Serving & Real-time ML

Low-latency inference: 10ms p99 at 1M+ QPS across regions
Online learning & real-time features: Flink, Kafka, Feast/Tecton patterns
Edge ML and on-device optimization where needed
Cost efficiency: Your team's optimizations save 8-figures annually

4. Team & Strategy

Hire and lead a 30-60 person org: ML Platform, Serving, Feature Store, GenAI Infra, SRE for ML
Set 3-year technical vision for AI infra. Review major RFCs and architecture decisions
Partner with Head of Data Science to unblock model launches and define interfaces
Partner with Data Eng on shared infrastructure: compute, storage, metadata
Represent AI Engineering with execs, board, and externally. Own the narrative

What You'll Bring

Must-haves:

12+ YOE in software/infra engineering, with 5+ YOE leading ML Platform, AI Infra, or similar orgs of 25+ people
You've personally scaled ML systems to 100k+ QPS and 1000+ GPUs. You know what breaks
Deep expertise across the stack: Kubernetes, Python, C++/Go, distributed systems, CUDA
Strong technical judgment: You've made build vs buy calls on feature stores, training platforms, LLM infra
Still hands-on: Can dive deep on a p99 latency spike, review a design doc, debug NCCL hangs
Track record hiring and growing Staff+ ICs and managers. People follow you

Nice-to-haves:

Experience supporting ranking/recsys, ads, or large-scale consumer ML
Shipped GenAI to prod: RAG, agents, fine-tuning at scale with real users
Open source cred: Kubeflow, Ray, vLLM, etc. Or strong publications in ML systems
Experience with multi-region, active-active ML serving

What Success Looks Like – Year 1

Velocity: P50 time from notebook prod experiment < 2 weeks for 80% of use cases
Reliability: Model serving SLO 99.99%. No SEV-0s caused by platform
Efficiency: GPU utilization >75%. ML infra cost per prediction down 30%
Adoption: 90% of DS/MLEs use the paved road vs building one-off solutions
Team: Hired 4+ Staff+ ICs and 2+ managers. Team eNPS >80

Why This Role

Greenfield: No legacy. You set the standards every AI company in SEA will copy in 5 years
Scale: Problems you solve here don't exist at 99% of companies. 1ms saved = $millions
Leverage: Your platform multiplies 200 people. Best ROI in the company

Or refer someone