MLOps Engineer

ABOUT CLIENT

Our client is a leading research company specializing in technology innovation

JOB DESCRIPTION

Develop and maintain training and inference pipelines using PyTorch, which includes DDP support, mixed precision, checkpointing, experiment versioning, and reproducible evaluation workflows.
Take ownership of and advance inference serving infrastructure using vLLM and SGLang, with a focus on debugging issues in inference stacks like tool call parsers and reasoning parsers, and optimizing for throughput and latency.
Create and sustain robust tooling in Python and C++ to aid the complete training lifecycle, from data ingestion to model release.
Optimize compute workloads for bare-metal environments, encompassing CPU/GPU utilization, memory bandwidth, and I/O throughput.
Address low-level networking issues, distributed training errors, and hardware bottlenecks across NCCL, MPI, and high-speed interconnects like InfiniBand and RoCE.
Set up and manage ML environments, covering containers, package management, GPU drivers, and runtime configurations.
Establish CI/CD patterns for AI workloads, encompassing training, evaluation, quantization, and model release workflows.
Integrate monitoring, alerting, anomaly detection, and incident response for both training jobs and inference services.
Contribute to shared platform capabilities across reliability, observability, and cost management.
Develop and maintain scalable runtime infrastructure for model-backed services and APIs, including support for LLM-backed APIs, MCP servers, and agentic systems.

JOB REQUIREMENT

Proficiency in PyTorch internals, including DDP, FSDP, mixed precision training, TorchScript, and torch.compile.
Strong programming skills in Python and C++, with the ability to understand and modify unfamiliar codebases.
Solid understanding of computer science basics including data structures, concurrency, operating systems, and memory management.
Practical experience with vLLM and SGLang for production inference serving, serving quantized models such as FP8, INT8, and NVFP4.
Experience with RLHF and PPO training pipelines, including frameworks like veRL and TRL, and integration of reward models.
Solid understanding of distributed training setups, networking, and interconnects including NCCL, MPI, InfiniBand, and RoCE.
Experience in debugging and optimizing bare-metal Linux servers, including kernel parameters, NUMA topology, and GPU driver configuration.
Familiarity with job schedulers such as Airflow and experience in operating production-grade distributed infrastructure.
Strong understanding of containerized and cloud-native environments using Docker and Kubernetes.
Familiarity with ML compiler stacks such as LLVM, MLIR, TensorRT, or XLA.
Knowledge of model quantization techniques and deployment optimization, including GPTQ, AWQ, and bitsandbytes.
Contributions to open source ML projects, including PyTorch, vLLM, SGLang, or related inference and training tooling.
Experience with infrastructure-as-code tools such as Ansible, Terraform, or Nix for reproducible cluster setup.
Experience with custom or on-premise deployments, local clusters, or edge inference.
Familiarity with observability stacks like Prometheus, Grafana, or OpenTelemetry applied to training and inference workloads.
Experience building infrastructure for agentic systems including secure tool access, orchestration, and isolation boundaries.
Passion for clean, well-documented code and detail-oriented engineering.

WHAT'S ON OFFER

Work remotely in an environment that promotes open-source collaboration
Enjoy 14 days of leave and unlimited sick days
Access to GPUs, AI credits, opportunities for fast career progression, and other perks.

CONTACT

PEGASI – IT Recruitment Consultancy | Email: recruit@pegasi.com.vn | Tel: +84 28 3622 8666
We are PEGASI – IT Recruitment Consultancy in Vietnam. If you are looking for new opportunity for your career path, kindly visit our website www.pegasi.com.vn for your reference. Thank you!

Job Summary

Company Type:

Product

Technical Skills:

Machine Learning, Devops

Location:

Ho Chi Minh - Viet Nam

Working Policy:

Onsite, Remote

Salary:

Negotiation

Job ID:

J01855

Status:

Close

Related Job:

Software Architect

Ho Chi Minh - Viet Nam


Outsource

  • Azure
  • .NET

Responsible for creating and overseeing integration architectures on Azure Converting business requirements into integration patterns, data flows, error handling, monitoring, and resiliency models Defining and directing the usage of Azure Integration Services, covering Logic Apps, Functions, API Management, Service Bus, and Event Hubs Leading integration platform design following Infrastructure as Code principles and cloud landing zone considerations Ensuring secure integration architectures utilizing OAuth, OIDC, and API security best practices Guiding development teams through architecture reviews, best practices, and reference implementations using C# and .NET Providing support for SAP integrations, including SAP S/4HANA, SAP PI/PO, and SAP BTP Integration Suite Contributing to integration platform modernization and legacy transformation initiatives, such as BizTalk migrations Collaborating with stakeholders, vendors, and delivery teams to ensure alignment and drive technical decisions

Negotiation

View details

Software Engineer

Ho Chi Minh - Viet Nam


Outsource

  • Azure
  • .NET

Creating API-based and event-driven integration solutions Developing integration solutions following Azure best practices and cloud-native patterns Constructing integrations using Azure Integration Services like Logic Apps, Functions, API Management, Service Bus, and Event Hubs Installing and managing SAP integrations, such as SAP S/4HANA, SAP PI/PO, or SAP BTP Integration Suite Building and maintaining integrations using C# and the .NET ecosystem Utilizing Infrastructure as Code practices with tools like Terraform Ensuring secure authentication, authorization, and API security utilizing OAuth and best practices Working with architects, developers, and clients to devise end-to-end integration solutions Assisting in deployments, monitoring, and continuous improvement of integration platforms, ensuring reliability and observability in production environments

Negotiation

View details

Senior .NET Engineer

Ho Chi Minh - Viet Nam


Product

  • .NET

Take charge of complex workflows: Collaborate with stakeholders to implement and integrate end-to-end processes, from claim intake to booking, stay, and payment platform. Develop scalable, distributed systems: Build resilient backend services using .NET, with a focus on microservices and ensuring high system reliability. Work on integration-heavy systems: Connect with external insurance and accommodation providers as well as internal systems using APIs and messaging patterns. Ensure system quality and reliability: Write unit and integration tests, troubleshoot production issues, and maintain high standards for performance and stability. Contribute to ongoing improvement: Refine and optimize existing systems, enhance architecture, and embrace best practices in software design. Collaborate in a cross-functional environment: Partner with Dev, PM, and QA engineers to deliver high-quality solutions. Drive technical documentation: Maintain clear and structured documentation to support system evolution and onboarding.

Negotiation

View details