Principal Engineer, System Software Platform Engineering

ABOUT CLIENT

Our client is a leading technology company specializing in graphics processing units (GPUs) and artificial intelligence (AI).

JOB DESCRIPTION

Create and manage a platform for AI that provides services for multiple users, handles identity and policy management, configures quotas, and controls costs. Additionally, this platform should offer easy paths for teams to work on AI projects.
Oversee the deployment of AI models at scale, including routing, autoscaling, and implementing safety measures to ensure reliability and observability.
Manage GPU resources in a Kubernetes environment, including device plugins, feature discovery, and scheduling strategies, among other responsibilities.
Take charge of the entire lifecycle of GPUs, ensuring that driver, firmware, and runtime updates are implemented safely and consistently.
Implement virtualization strategies for GPU resources, such as vGPU and PCIe passthrough, while defining policies for resource placement, isolation, and preemptive actions.
Establish secure traffic and networking protocols, including gateways, service mesh, and authentication/authorization measures.
Enhance observability and operational efficiency through monitoring tools for GPUs, response protocols for incidents, and optimization of costs.
Develop reusable templates, integrate SDKs and CLIs, and implement infrastructure-as-code standards for the platform.
Influence the platform's direction by creating design documents, mentoring engineers, and aligning platform development with the needs of AI products.

JOB REQUIREMENT

Requires a minimum of 15 years' experience in building and operating large-scale distributed systems or platform infrastructure, with a strong track record of shipping production services.
Proficiency in one or more of Python, Go, Java, C++, with a deep understanding of concurrency, networking, and systems design.
Expertise in containers, orchestration, Kubernetes, cloud networking, storage, IAM, and infrastructure-as-code.
Practical experience with GPU platforms, including Kubernetes GPU operations, scheduling, isolation, preemption, and utilization tuning.
Background in virtualization, including deploying and operating vGPU, PCIe pass-through, and mediated devices in production.
Experience in Site Reliability Engineering (SRE) or equivalent, including SLOs, incident management, performance tuning, resource management, and financial oversight.
Strong security mindset, with experience in TLS/mTLS, RBAC, secrets, policy-as-code, and secure multi-tenant architectures.
In-depth experience in GPU operations, including MIG partitioning, MPS sharing, NUMA/topology awareness, DCGM telemetry, GPUDirect RDMA/Storage.
Exposure to inference platforms, including serving runtimes, caching/batching, autoscaling patterns, and continuous delivery.
Exposure to agentic platforms, including workflow engines, tool orchestration, policy/guardrails for tool access, and data boundaries.
Experience in traffic/data plane technologies, such as gRPC, HTTP, Protobuf, service mesh, API gateways, CDN/caching, and global traffic management.
Proficiency with tools such as Terraform, Helm, GitOps, Prometheus, Grafana, OpenTelemetry, and policy engines; bare-metal provisioning experience is a plus.

WHAT'S ON OFFER

CONTACT

PEGASI – IT Recruitment Consultancy | Email: recruit@pegasi.com.vn | Tel: +84 28 3622 8666
We are PEGASI – IT Recruitment Consultancy in Vietnam. If you are looking for new opportunity for your career path, kindly visit our website www.pegasi.com.vn for your reference. Thank you!

Job Summary

Company Type:

Product

Technical Skills:

Devops, Backend, AI

Location:

Ho Chi Minh, Ha Noi - Viet Nam

Working Policy:

Onsite

Job ID:

J01969

Status:

Close

Related Job:

Simulation Engineer (Mujuco)

Ho Chi Minh - Viet Nam


Product

  • Python
  • C/C++

Create and maintain high-fidelity digital twin environments across software platforms, calibrated to real hardware behavior. Develop systems for locomotion, autonomy, and perception teams to generate, validate, and iterate on simulation scenarios at scale. Establish pipelines for asset import, workflows, sensor modeling, and real-to-sim calibration to ensure digital twins remain synchronized with evolving hardware. Design photorealistic rendering pipelines for synthetic data generation and perception model training. Collaborate with hardware and mechatronics teams to model actuator dynamics, contact physics, and structural behavior in digital twins. Integrate digital twin environments with locomotion training pipeline and autonomy stack. Contribute to the open-source simulation stack, including tooling, documentation, and reproducible environment workflows.

Negotiation

View details

Head of AI Factory

Ha Noi - Viet Nam


Product, Bank

  • AI

Develop and execute enterprise-wide data science and AI strategy that aligns with business priorities. Provide guidance to C-level executives on leveraging data for business growth, risk mitigation, and operational efficiency. Promote the use of AI best practices among subsidiary companies. Lead the development of Predictive AI, including data and feature engineering, and model lifecycle management. Spearhead Generative AI initiatives, such as prompt frameworks, knowledge integration, and safety protocols. Assess and implement model solutions based on business, cost, risk, and performance considerations. Manage MLOps & LLMOps pipelines to ensure scalable deployment and automation for predictive and generative models. Create reusable AI assets and platforms, such as feature stores, model registries, and inference APIs. Work with IT and Data Architecture teams to create scalable data platforms, pipelines, and AI/ML infrastructure for both ML & GenAI, supporting both batch and real-time flow. Drive experimentation and research to keep up with practical emerging AI technologies. Establish ethical AI practices and ensure compliance with data privacy, regulatory, and security requirements. Collaborate with business units to advise on the application of AI/GenAI for business. Work with business and product owners to define problem statements, estimate value, build ROI models, and measure post-deployment outcomes. Provide leadership and management to enable subordinates to achieve AI Factory goals. Plan and allocate human resources and work with HR on recruitment, training, career development, and performance management. Develop talent and organizational capability in AI/GenAI, providing coaching and leadership to team members. Serve as a role model in building corporate culture and ensure consistent implementation of corporate cultural values.

Negotiation

View details

Tech Lead (C#/.NET - JTL AI Service Desk)

Ho Chi Minh - Viet Nam


Outsource

  • .NET
  • ReactJS
  • Azure

Create and enhance scalable backend services using C# and .NET technologies Guide architectural choices for distributed and service-oriented systems Construct dependable APIs, integrations, and asynchronous processing workflows Work with AI and data teams to incorporate intelligent automation capabilities into the platform Enhance platform reliability, observability, security, and performance Lead technical discussions, code reviews, and engineering best practices Coach engineers and promote technical development across the team Contribute to long-term platform strategy and technical roadmap Collaborate with frontend, DevOps, and product teams to produce high-quality solutions

Negotiation

View details