Senior ML Infrastructure Engineer (US Company)

Ho Chi Minh, Vietnam

Job Openings Senior ML Infrastructure Engineer (US Company)

About the job Senior ML Infrastructure Engineer (US Company)

Our Client is seeking a Senior ML Infrastructure Engineer to build the data and model pipelines that power our AI development lifecycle. This role focuses on creating reliable, automated processes for model training, evaluation, and deployment enabling ML Engineers and Data Scientists to move from experiment to production with speed and confidence.

You will work closely with the platform team (handling Terraform, clusters, observability, and CI/CD) to ensure seamless integration between process-level automation and infrastructure reliability.

Responsibilities:

Design and implement reproducible ML pipelines for data ingestion, feature engineering, model training, evaluation, and deployment.
Develop internal tooling and workflows for experiment tracking, model registry, and automated evaluation.
Develop automated model evaluation and validation pipelines, integrating statistical tests, fairness checks, and performance regressions.
Optimize training workflows for distributed GPUs and cloud resource efficiency.
Collaborate with ML Engineers to deploy and monitor models in production using modern CI/CD practices, and to abstract complex infra into easy-to-use templates or SDKs that improve team productivity
Collaborate with Data Scientists to standardize experiment setup, metadata, and artifact management.
Establish observability standards for ML systems (logging, metrics, tracing, alerts).
Partner with the US-based platform and data engineering teams to align pipeline needs with platform capabilities (e.g., storage, compute orchestration, secrets management).

Qualifications:

Bachelors or Masters in Computer Science, Engineering, or a related technical field.
3+ years of experience in software engineering or ML infrastructure roles.
Proficiency in Python, SQL, and Shell scripting.
Hands-on experience with cloud services (AWS/GCP/Azure) and container orchestration (Docker, Kubernetes).
Experience building data or model pipelines using tools like Airflow, Kubeflow, or MLFlow.
Familiarity with model deployment frameworks (TensorFlow Serving, Triton, VLLM, FastAPI).
Experience implementing CI/CD for ML systems and infrastructure as code (e.g. Terraform).
Fluent English communication to collaborate with US-based teams.

Preferred Qualifications:

Experience with GPU workload optimization and distributed training.
Experience setting up feature stores, vector databases, or retrieval infrastructure for LLMs.
Familiarity with model monitoring and evaluation in production environments.
Strong understanding of security and compliance in ML infrastructure.