MLOps Engineer

ABOUT CLIENT

Our client is a leading research company specializing in technology innovation

JOB DESCRIPTION

Develop and maintain training and inference pipelines using PyTorch, which includes DDP support, mixed precision, checkpointing, experiment versioning, and reproducible evaluation workflows.
Take ownership of and advance inference serving infrastructure using vLLM and SGLang, with a focus on debugging issues in inference stacks like tool call parsers and reasoning parsers, and optimizing for throughput and latency.
Create and sustain robust tooling in Python and C++ to aid the complete training lifecycle, from data ingestion to model release.
Optimize compute workloads for bare-metal environments, encompassing CPU/GPU utilization, memory bandwidth, and I/O throughput.
Address low-level networking issues, distributed training errors, and hardware bottlenecks across NCCL, MPI, and high-speed interconnects like InfiniBand and RoCE.
Set up and manage ML environments, covering containers, package management, GPU drivers, and runtime configurations.
Establish CI/CD patterns for AI workloads, encompassing training, evaluation, quantization, and model release workflows.
Integrate monitoring, alerting, anomaly detection, and incident response for both training jobs and inference services.
Contribute to shared platform capabilities across reliability, observability, and cost management.
Develop and maintain scalable runtime infrastructure for model-backed services and APIs, including support for LLM-backed APIs, MCP servers, and agentic systems.

JOB REQUIREMENT

Proficiency in PyTorch internals, including DDP, FSDP, mixed precision training, TorchScript, and torch.compile.
Strong programming skills in Python and C++, with the ability to understand and modify unfamiliar codebases.
Solid understanding of computer science basics including data structures, concurrency, operating systems, and memory management.
Practical experience with vLLM and SGLang for production inference serving, serving quantized models such as FP8, INT8, and NVFP4.
Experience with RLHF and PPO training pipelines, including frameworks like veRL and TRL, and integration of reward models.
Solid understanding of distributed training setups, networking, and interconnects including NCCL, MPI, InfiniBand, and RoCE.
Experience in debugging and optimizing bare-metal Linux servers, including kernel parameters, NUMA topology, and GPU driver configuration.
Familiarity with job schedulers such as Airflow and experience in operating production-grade distributed infrastructure.
Strong understanding of containerized and cloud-native environments using Docker and Kubernetes.
Familiarity with ML compiler stacks such as LLVM, MLIR, TensorRT, or XLA.
Knowledge of model quantization techniques and deployment optimization, including GPTQ, AWQ, and bitsandbytes.
Contributions to open source ML projects, including PyTorch, vLLM, SGLang, or related inference and training tooling.
Experience with infrastructure-as-code tools such as Ansible, Terraform, or Nix for reproducible cluster setup.
Experience with custom or on-premise deployments, local clusters, or edge inference.
Familiarity with observability stacks like Prometheus, Grafana, or OpenTelemetry applied to training and inference workloads.
Experience building infrastructure for agentic systems including secure tool access, orchestration, and isolation boundaries.
Passion for clean, well-documented code and detail-oriented engineering.

WHAT'S ON OFFER

Work remotely in an environment that promotes open-source collaboration
Enjoy 14 days of leave and unlimited sick days
Access to GPUs, AI credits, opportunities for fast career progression, and other perks.

CONTACT

PEGASI – IT Recruitment Consultancy | Email: recruit@pegasi.com.vn | Tel: +84 28 3622 8666
We are PEGASI – IT Recruitment Consultancy in Vietnam. If you are looking for new opportunity for your career path, kindly visit our website www.pegasi.com.vn for your reference. Thank you!

Job Summary

Company Type:

Product

Technical Skills:

Machine Learning, Devops

Location:

Ho Chi Minh - Viet Nam

Working Policy:

Onsite, Remote

Job ID:

J01855

Status:

Close

Related Job:

Senior System Engineer

Ha Noi - Viet Nam


Outsource

  • System

Triển khai & Tích hợp: Cài đặt, cấu hình và tích hợp hệ thống máy chủ (Server), tủ đĩa (Storage), giải pháp sao lưu (Backup) và hệ thống giám sát (Monitoring) theo yêu cầu dự án. Thiết kế giải pháp: Đề xuất cấu hình kỹ thuật phù hợp với nhu cầu khách hàng, đảm bảo hiệu quả và khả năng mở rộng. Giải pháp DC/DR & Ảo hóa: Triển khai các giải pháp trung tâm dữ liệu (DC/DR), ảo hóa, clustering và replication. Phối hợp liên phòng ban: Làm việc chặt chẽ với các nhóm Network, Security, Application để đảm bảo tích hợp và tương thích hệ thống. Đào tạo & Chuyển giao: Thực hiện đào tạo và chuyển giao công nghệ cho khách hàng. Quản trị & Tối ưu hóa: Xử lý sự cố, tư vấn nâng cấp và tối ưu hóa hệ thống. Nghiên cứu & Cập nhật công nghệ: Liên tục cập nhật, nghiên cứu công nghệ mới và tham gia đánh giá các giải pháp mới cho công ty.

Negotiation

View details

Senior Data Engineer

Ha Noi - Viet Nam


Outsource

  • Data Engineering

Triển khai dự án dữ liệu: Tham gia triển khai các dự án Data Platform, Data Warehouse, Data Lakehouse, AI Platform… chủ yếu trong lĩnh vực Tài chính - Ngân hàng. Thiết kế & Vận hành: Thực hiện thiết kế, cài đặt cấu hình và vận hành toàn bộ phần mềm, dịch vụ trong nền tảng dữ liệu (database, data ingestion, data governance, orchestration, query engine…). Hỗ trợ khách hàng & đối tác: Phát triển, tối ưu data pipeline, báo cáo; hỗ trợ migrate dữ liệu từ hệ thống cũ sang nền tảng mới. Đầu mối kỹ thuật: Làm việc trực tiếp với hãng và đối tác trong các trao đổi kỹ thuật. Tài liệu triển khai: Lập và quản lý tài liệu liên quan đến quá trình triển khai. Trình bày & Giới thiệu giải pháp: Tham gia trình bày, giới thiệu các giải pháp dữ liệu với khách hàng. Nghiên cứu & Phát triển: Nghiên cứu chuyên sâu các công nghệ và giải pháp dữ liệu theo định hướng công ty. Các công việc khác: Thực hiện nhiệm vụ khác theo yêu cầu.

Negotiation

View details

Software Engineer (Node.js)

Ho Chi Minh - Viet Nam


Product

  • NodeJS
  • AWS

Take charge of designing and creating system architectures, implementing coding standards, and building cloud-native solutions. Develop and optimize high-quality Node.js code, address software integration challenges, and enhance system performance. Supervise the testing, deployment, and thorough documentation of integrated systems. Provide guidance and support to junior engineers, collaborate with cross-functional teams, and ensure that solutions align with business needs and international standards. Actively participate in every phase of Agile software development, such as generating user stories and conducting sprint planning. Interact with a diverse range of companies, demonstrating flexibility to accommodate occasional shifts in working hours to accommodate global time zones.

Negotiation

View details