Principal Engineer, System Software Platform Engineering

ABOUT CLIENT

Our client is a leading technology company specializing in graphics processing units (GPUs) and artificial intelligence (AI).

JOB DESCRIPTION

Create and manage a platform for AI that provides services for multiple users, handles identity and policy management, configures quotas, and controls costs. Additionally, this platform should offer easy paths for teams to work on AI projects.
Oversee the deployment of AI models at scale, including routing, autoscaling, and implementing safety measures to ensure reliability and observability.
Manage GPU resources in a Kubernetes environment, including device plugins, feature discovery, and scheduling strategies, among other responsibilities.
Take charge of the entire lifecycle of GPUs, ensuring that driver, firmware, and runtime updates are implemented safely and consistently.
Implement virtualization strategies for GPU resources, such as vGPU and PCIe passthrough, while defining policies for resource placement, isolation, and preemptive actions.
Establish secure traffic and networking protocols, including gateways, service mesh, and authentication/authorization measures.
Enhance observability and operational efficiency through monitoring tools for GPUs, response protocols for incidents, and optimization of costs.
Develop reusable templates, integrate SDKs and CLIs, and implement infrastructure-as-code standards for the platform.
Influence the platform's direction by creating design documents, mentoring engineers, and aligning platform development with the needs of AI products.

JOB REQUIREMENT

Requires a minimum of 15 years' experience in building and operating large-scale distributed systems or platform infrastructure, with a strong track record of shipping production services.
Proficiency in one or more of Python, Go, Java, C++, with a deep understanding of concurrency, networking, and systems design.
Expertise in containers, orchestration, Kubernetes, cloud networking, storage, IAM, and infrastructure-as-code.
Practical experience with GPU platforms, including Kubernetes GPU operations, scheduling, isolation, preemption, and utilization tuning.
Background in virtualization, including deploying and operating vGPU, PCIe pass-through, and mediated devices in production.
Experience in Site Reliability Engineering (SRE) or equivalent, including SLOs, incident management, performance tuning, resource management, and financial oversight.
Strong security mindset, with experience in TLS/mTLS, RBAC, secrets, policy-as-code, and secure multi-tenant architectures.
In-depth experience in GPU operations, including MIG partitioning, MPS sharing, NUMA/topology awareness, DCGM telemetry, GPUDirect RDMA/Storage.
Exposure to inference platforms, including serving runtimes, caching/batching, autoscaling patterns, and continuous delivery.
Exposure to agentic platforms, including workflow engines, tool orchestration, policy/guardrails for tool access, and data boundaries.
Experience in traffic/data plane technologies, such as gRPC, HTTP, Protobuf, service mesh, API gateways, CDN/caching, and global traffic management.
Proficiency with tools such as Terraform, Helm, GitOps, Prometheus, Grafana, OpenTelemetry, and policy engines; bare-metal provisioning experience is a plus.

WHAT'S ON OFFER

CONTACT

PEGASI – IT Recruitment Consultancy | Email: recruit@pegasi.com.vn | Tel: +84 28 3622 8666
We are PEGASI – IT Recruitment Consultancy in Vietnam. If you are looking for new opportunity for your career path, kindly visit our website www.pegasi.com.vn for your reference. Thank you!

Job Summary

Company Type:

Product

Technical Skills:

Devops, Backend, AI

Location:

Ho Chi Minh, Ha Noi - Viet Nam

Working Policy:

Onsite

Salary:

Negotiation

Job ID:

J01969

Status:

Close

Related Job:

Embedded Software Project Manager (Chinese speaking)

Ho Chi Minh, Ha Noi - Viet Nam


Outsource

  • Project Management
  • Embedded

Participation in the software development of the latest generation of IVI. Direct collaboration with automotive experts from various countries. Exposure to cutting-edge technologies in the automotive industry. Interface with customers and stakeholders throughout the software product's lifecycle. Manage project schedule and deliverables within defined Q-C-D. Analyze customer requirements and document with the support of the development team. Issue development requests to developers and monitor progress. Develop the team and provide feedback to associates as required. Lead the project team to solve software development problems to ensure delivery milestones and quality. Ensure Quality, Cost, and Delivery of the project.

Negotiation

View details

Embedded Software Project Manager (Japanese speaking)

Ho Chi Minh, Ha Noi - Viet Nam


Outsource

  • BSE/BrSE
  • Project Management
  • Embedded

Participate in the software development of the latest IVI generation. Collaborate directly with automotive experts from various countries such as Japan, China, India, and Germany. Gain exposure to cutting-edge technologies in the automotive industry. Interface with customers and stakeholders throughout the acquisition phase until the final release of the software. Manage project schedule and deliverables within defined Quality-Cost-Delivery parameters. Analyze customer requirements, document them with the support of the development team, and issue development requests to developers, tracking progress. Develop the team, provide feedback to associates when necessary, and lead the project team in problem-solving during software development to ensure delivery milestones and quality.

Negotiation

View details

(Senior) Embedded Security Engineer – Linux / Android Platforms

Ho Chi Minh, Ha Noi - Viet Nam


Outsource

  • Security
  • Embedded

Security Architecture & Engineering: Design and implement security features for embedded platforms, contribute to security architecture definition, and perform security architecture reviews and threat modeling. Security Implementation & Hardening: Implement security hardening for Linux / Android / QNX systems, conduct secure code reviews, and support integration of access control and system hardening mechanisms (e.g. SELinux, AppArmor). Threat Modeling & Reviews: Identify risks, define mitigation strategies, and drive security improvements early in the development lifecycle. Testing & Validation: Perform security testing and validation, and ensure compliance with relevant security standards and best practices.

Negotiation

View details