About the job HPC Service Delivery Manager
Job Title: Service Delivery Manager – HPC Managed Services
Location: Singapore
Working Hours: 08:30 – 18:00, Monday to Friday (excluding Public Holidays)
Role Overview
The Service Delivery Manager (SDM) governs the end-to-end delivery, quality, and continual improvement of Managed Services for the Authority's HPC (High-Performance Computing) environment. This role ensures operational excellence, risk management, compliance with contractual SLAs, and effective coordination between technical specialists, OEM partners, and the Authority.
The SDM acts as the primary operational interface to the customer, ensuring that ITIL-based service management processes and evidence-driven reporting are consistently applied across all HPC service towers—from OS and Scheduler Operations to Filesystem, Network, Data Lake, and Cybersecurity functions.
Key Responsibilities
1. Governance and Service Oversight
- Lead and oversee daily service delivery across all HPC operational domains including compute, interconnect, storage, data lake, and supporting virtualisation platforms.
- Govern ITSM adherence across Incident, Problem, Change, Release, Access, and Knowledge Management processes.
- Ensure all service artefacts, records, and evidence are maintained for audit readiness and KPI reporting.
- Serve as escalation owner for operational, service quality, and performance issues.
2. Service Management and Compliance
- Maintain ITIL-aligned operational processes, SLAs, and SLOs; ensure service metrics are tracked and reported monthly.
- Chair operational governance meetings (daily, weekly, monthly) with the Authority and internal teams.
- Drive continuous improvement and risk remediation plans, linking incidents and problems to measurable root causes and corrective actions.
- Coordinate with Security and Compliance representatives to validate vulnerability closure, patch cadence, and control hygiene.
3. Stakeholder and Customer Engagement
- Act as the primary customer point of contact for service performance, escalations, and audit reviews.
- Manage communications during P1/P2 incidents and planned maintenance.
- Align service delivery goals with customer business outcomes and Authority policies.
- Support service catalogue maintenance and user-facing communication consistency.
4. Operational Performance and Reporting
- Ensure all service elements under scope (compute, storage, scheduler, data lake, IAM, monitoring, cybersecurity) meet defined performance and stability objectives.
- Review and validate operational evidence—health checks, configuration drift reports, backup logs, SLI dashboards, and utilisation trends.
- Produce monthly service and KPI dashboards with analysis, risk register updates, and continual improvement actions.
- Oversee QA of staging and production change records with associated evidence (test results, rollback verification, and post-implementation reviews).
5. Leadership and Team Coordination
- Mentor and guide technical leads across multiple HPC service towers; coordinate cross-domain incident response and problem resolution.
- Maintain an escalation matrix for both technical and management layers, ensuring compliance with agreed response timelines.
- Drive a culture of accountability, evidence-based operations, and service excellence within the managed service team.
Required Qualifications and Competencies
Education & Certification
- Bachelor's degree in Computer Science, Information Systems, or equivalent technical discipline.
- ITIL Foundation v4 or higher certification (Practitioner or Intermediate preferred).
Experience
- Minimum 5 years of experience in IT Service Management or delivery leadership, ideally in large-scale managed services.
- Proven experience managing HPC, research computing, or mission-critical scientific computing environments.
- Demonstrated success in customer-facing governance and operational reporting roles.
- Strong understanding of HPC system components: OS/runtime environments, schedulers, parallel file systems, high-speed networks (InfiniBand/Ethernet), virtualisation, and observability tooling.
- Technical and Management Skills
- Familiarity with ITSM tools, runbook governance, evidence retention practices.
- Competency in capacity management, incident command, and continual improvement cycles.
- Excellent analytical, documentation, and communication skills.
- Ability to coordinate multi-vendor, multi-domain operations with minimal supervision.
Key Attributes
- Evidence-driven and methodical in governance and risk assessment.
- Comfortable navigating both technical and executive discussions.
- Client-centred mindset with a proactive focus on service resilience and improvement.
- Strong leadership under operational pressure and tight SLAs.