Principal Engineer, System Software Platform Engineering

JOB DESCRIPTION

Build and operate the platform for AI: multi-tenant services, identity/policy, configuration, quotas, cost controls, and paved paths for teams.
Lead inference platforms at scale, including model-serving routing, autoscaling, rollout safety (canary/A-B), ensuring reliability, and maintaining end-to-end observability.
Operate GPUs in Kubernetes: lead Our Client device plugins, GPU Feature Discovery, time-slicing, MPS, and MIG partitioning; implement topology-aware scheduling and bin-packing.
Lead GPU lifecycle: driver/firmware/Runtime (CUDA, cuDNN, NCCL) updates via GPU Operator; ensure kernel/RHEL/Ubuntu compatibility and safe rollouts.
Enable virtualization strategies: vGPU (e.g., on vSphere/KVM), PCIe passthrough, mediated devices, and pool-based GPU sharing; define placement, isolation, and preemption policies.
Build secure traffic and networking: API gateways, service mesh, rate limiting, authN/authZ, multi-region routing, and DR/failover.
Improve observability and operations through metrics, tracing, and logging for DCGM/GPUs, runbooks, incident response, performance, and cost optimization.
Establish platform blueprints: reusable templates, SDKs/CLIs, golden CI/CD pipelines, and infrastructure-as-code standards.
Lead through influence: write design docs, conduct reviews, mentor engineers, and shape platform roadmaps aligned to AI product needs.

JOB REQUIREMENT

15+ years building/operating large-scale distributed systems or platform infrastructure; strong record of shipping production services.
Proficiency in one or more of Python/Go/Java/C++; deep understanding of concurrency, networking, and systems design.
Containers/orchestration/Kubernetes expertise, cloud networking/storage/IAM, and infrastructure-as-code.
Practical GPU platform experience: Kubernetes GPU operations (device plugin, GPU Operator, feature discovery), scheduling/bin-packing, isolation, preemption, utilization tuning.
Virtualization background: deploying and operating vGPU, PCIe pass-through, and/or mediated devices in production.
SRE or equivalent experience: SLOs/error budgets, incident management, performance tuning, resource management, and financial oversight.
Security-first mentality: TLS/mTLS, RBAC, secrets, policy-as-code, and secure multi-tenant architectures.
Ways to stand out from a crowd:
Deep GPU ops: MIG partitioning, MPS sharing, NUMA/topology awareness, DCGM telemetry, GPUDirect RDMA/Storage.
Inference platform exposure: serving runtimes, caching/batching, autoscaling patterns, continuous delivery (agnostic to specific stacks).
Agentic platform exposure: workflow engines, tool orchestration, policy/guardrails for tool access and data boundaries.
Traffic/data plane: gRPC/HTTP/Protobuf performance, service mesh, API gateways, CDN/caching, global traffic management.
Tooling: Terraform/Helm/GitOps, Prometheus/Grafana/OpenTelemetry, policy engines; bare-metal provisioning experience is a plus.

WHAT'S ON OFFER

This is template

CONTACT

PEGASI – IT Recruitment Consultancy | Email: recruit@pegasi.com.vn | Tel: +84 28 3622 8666
We are PEGASI – IT Recruitment Consultancy in Vietnam. If you are looking for new opportunity for your career path, kindly visit our website www.pegasi.com.vn for your reference. Thank you!

Job Summary

Company Type:

Computer Hardware

Technical Skills:

Devops, Backend, AI

Location:

Ho Chi Minh, Ha Noi - Viet Nam

Working Policy:

Onsite

Salary:

Negotiation

Job ID:

J01969

Status:

Active

Related Job:

Senior Deep Learning Engineer - AI for Wireless Systems

Ho Chi Minh, Ha Noi - Viet Nam


Computer Hardware

  • Machine Learning

Design and prototype deep learning models for wireless signal processing tasks such as channel estimation, beam alignment, link adaptation, and scheduling. Work with simulation tools and real-world datasets to build models that generalize across diverse wireless scenarios. Implement, train, and validate neural networks (e.g., CNNs, Transformers, GNNs) using PyTorch or TensorFlow. Collaborate with researchers and system engineers to integrate models into fullstack RAN. Optimize model performance for real-time inference and hardware acceleration. Contribute to model evaluation, benchmarking, and deployment readiness on GPU platforms.

Negotiation

View details

Engineering Manager - AI for RAN and 6G Wireless Systems

Ho Chi Minh, Ha Noi - Viet Nam


Computer Hardware

  • Machine Learning
  • Management

Lead and grow a high-impact engineering team focused on AI-enabled signal processing for the Radio Access Network (RAN). Guide the development of deep learning models for tasks such as channel estimation, beamforming, link adaptation, and CSI compression. Collaborate with global teams across architecture, research, and systems to drive proof-of-concepts and production-quality AI-RAN components. Oversee integration of AI models into full-stack simulations and/or testbeds using frameworks such as PyTorch, TensorFlow, and Sionna. Align project priorities with hardware-software co-design constraints and deployment scenarios on Our Client's platforms. Mentor team members, ensure technical excellence, and contribute to strategic direction.

Negotiation

View details

Director Engineering – Software Engineering and AI Inferencing Platforms

Ho Chi Minh, Ha Noi - Viet Nam


Computer Hardware

  • Management
  • Backend
  • Cloud
  • Data Engineering
  • AI

Build, lead and scale world-class engineering teams in Vietnam, collaborating with global counterparts across system software, data science, and AI platforms. Drive the design, architecture, and delivery of high-performance system software platforms that power Our Client's AI products and services. Partner with global teams across Machine Learning, Inference Services, and Hardware/Software integration to ensure performance, reliability, and scalability. Oversee the development and optimization of AI delivery platforms in Vietnam, including NIMs, Blueprints, and other flagship Our Client's services. Engage with open-source and enterprise data and workflow ecosystems (e.g., Temporal, Gitlab DevOps Platform, RAPIDS, NeMo Curator, Morpheus) to advance accelerated AI factory, data science and data engineering workloads. Champion continuous integration, continuous delivery, and engineering best practices across multi-site R&D Centers. Collaborate with product management and cross-functional stakeholders to ensure enterprise readiness and customer impact. Develop and deploy standard processes for large-scale, distributed system testing, encompassing stress, scale, failover, and resiliency testing. Ensure security and compliance testing aligns with industry standards for cloud and data center products. Mentor and develop talent within the organization, fostering a culture of quality and continuous improvement.

Negotiation

View details