Principal Engineer, System Software Platform Engineering

ABOUT CLIENT

Our client is a leading technology company specializing in graphics processing units (GPUs) and artificial intelligence (AI).

JOB DESCRIPTION

Create and manage a platform for AI that provides services for multiple users, handles identity and policy management, configures quotas, and controls costs. Additionally, this platform should offer easy paths for teams to work on AI projects.
Oversee the deployment of AI models at scale, including routing, autoscaling, and implementing safety measures to ensure reliability and observability.
Manage GPU resources in a Kubernetes environment, including device plugins, feature discovery, and scheduling strategies, among other responsibilities.
Take charge of the entire lifecycle of GPUs, ensuring that driver, firmware, and runtime updates are implemented safely and consistently.
Implement virtualization strategies for GPU resources, such as vGPU and PCIe passthrough, while defining policies for resource placement, isolation, and preemptive actions.
Establish secure traffic and networking protocols, including gateways, service mesh, and authentication/authorization measures.
Enhance observability and operational efficiency through monitoring tools for GPUs, response protocols for incidents, and optimization of costs.
Develop reusable templates, integrate SDKs and CLIs, and implement infrastructure-as-code standards for the platform.
Influence the platform's direction by creating design documents, mentoring engineers, and aligning platform development with the needs of AI products.

JOB REQUIREMENT

Requires a minimum of 15 years' experience in building and operating large-scale distributed systems or platform infrastructure, with a strong track record of shipping production services.
Proficiency in one or more of Python, Go, Java, C++, with a deep understanding of concurrency, networking, and systems design.
Expertise in containers, orchestration, Kubernetes, cloud networking, storage, IAM, and infrastructure-as-code.
Practical experience with GPU platforms, including Kubernetes GPU operations, scheduling, isolation, preemption, and utilization tuning.
Background in virtualization, including deploying and operating vGPU, PCIe pass-through, and mediated devices in production.
Experience in Site Reliability Engineering (SRE) or equivalent, including SLOs, incident management, performance tuning, resource management, and financial oversight.
Strong security mindset, with experience in TLS/mTLS, RBAC, secrets, policy-as-code, and secure multi-tenant architectures.
In-depth experience in GPU operations, including MIG partitioning, MPS sharing, NUMA/topology awareness, DCGM telemetry, GPUDirect RDMA/Storage.
Exposure to inference platforms, including serving runtimes, caching/batching, autoscaling patterns, and continuous delivery.
Exposure to agentic platforms, including workflow engines, tool orchestration, policy/guardrails for tool access, and data boundaries.
Experience in traffic/data plane technologies, such as gRPC, HTTP, Protobuf, service mesh, API gateways, CDN/caching, and global traffic management.
Proficiency with tools such as Terraform, Helm, GitOps, Prometheus, Grafana, OpenTelemetry, and policy engines; bare-metal provisioning experience is a plus.

WHAT'S ON OFFER

CONTACT

PEGASI – IT Recruitment Consultancy | Email: recruit@pegasi.com.vn | Tel: +84 28 3622 8666
We are PEGASI – IT Recruitment Consultancy in Vietnam. If you are looking for new opportunity for your career path, kindly visit our website www.pegasi.com.vn for your reference. Thank you!

Job Summary

Company Type:

Product

Technical Skills:

Devops, Backend, AI

Location:

Ho Chi Minh, Ha Noi - Viet Nam

Working Policy:

Onsite

Job ID:

J01969

Status:

Close

Related Job:

AI Software Transformation Engineer (Distributed Computing)

Ho Chi Minh - Viet Nam


Product

  • Data Engineering
  • Backend
  • Spark
  • AI

Create an advanced AI-powered software transformation framework to speed up the modernization of complex analytical applications. Develop architectural patterns and transformation methodologies for converting outdated computational tools into scalable cloud-native solutions. Utilize AI agents, LLMs, and emerging AI engineering techniques to automate software analysis, code transformation, validation, and optimization. Work with distributed computing specialists to design target architectures that leverage Spark-based execution models for large-scale data processing. Lead technical investigations into restructuring, decomposing, or re-implementing existing software systems for efficient operation in distributed environments. Develop reusable transformation pipelines, automation tooling, and engineering frameworks for large-scale software modernization. Establish validation strategies and quality frameworks to ensure that transformed systems maintain functional correctness and reproducibility. Make architectural decisions regarding scalability, maintainability, performance, and long-term platform evolution. Collaborate with domain experts to understand application requirements and translate them into scalable technical solutions. Prototype and assess new AI-assisted engineering approaches to enhance transformation speed, engineering productivity, and software quality. Contribute to the organization's long-term strategy for AI-driven software modernization and engineering automation.

Negotiation

View details

Senior Quality Engineer (Automation, Backend)

Ho Chi Minh - Viet Nam


Product

  • Automation Test

Lead test automation strategy and framework design for backend and cloud-based services. Drive end-to-end test automation initiatives using Cypress to ensure seamless user experiences. Perform thorough manual testing for complex workflows requiring deep attention to UX and usability details. Implement continuous integration and deployment test practices such as GitHub Actions and Jenkins. Collaborate with developers and DevOps to enhance test reliability and coverage. Review code and advocate for QA best practices across teams. Identify quality risks early and actively seek solutions. Ensure release compliance through test result reporting.

Negotiation

View details

Senior Quality Engineer (Automation, Full Stack)

Ho Chi Minh - Viet Nam


Product

  • Automation Test

Develop a test automation strategy and framework for backend and cloud-based services. Implement E2E test automation initiatives, using Cypress to ensure smooth user experiences. Perform thorough manual testing for complex workflows focusing on UX and usability details. Write and manage frontend component and unit tests using Jest and React Testing Library. Create and execute API-level test suites, covering REST endpoints and validating request/response contracts and error handling. Verify data integrity from UI interactions through the API layer down to database state. Implement continuous integration and deployment test practices (e.g., GitHub Actions, Jenkins). Collaborate with developers and DevOps to enhance test reliability and coverage. Review code and advocate for QA best practices. Anticipate quality risks and drive proactive solutions. Ensure compliance with releases through test result reporting.

Negotiation

View details