MLOps Engineer (PyTorch)

ABOUT CLIENT

Our client is a leading research company specializing in technology innovation

JOB DESCRIPTION

Develop and manage training and inference pipelines using PyTorch
Create and maintain efficient tooling in Python and C++ to support the training process
Take ownership of the training codebase to ensure clarity, modularity, reproducibility, and performance
Plan and establish workflows for checkpointing, resuming, versioning, and tracking experiments
Streamline compute workloads for bare-metal environments including I/O, CPU/GPU utilization, and memory optimization
Address low-level networking issues, distributed training errors, and hardware bottlenecks
Establish and oversee ML environments including containers, package management, drivers, and runtime configs
Monitor and troubleshoot training jobs across multiple nodes and GPUs
Develop enduring systems designed for scalability, maintainability, and long-term usage

JOB REQUIREMENT

Strong proficiency in PyTorch, including DDP, mixed precision, and TorchScript
Proficient programming skills in C++ and Python
Sound understanding of computer science fundamentals including data structures, concurrency, and operating systems
Ability to debug and optimize bare-metal servers (Linux, kernel parameters, BIOS tuning)
Proficient understanding of networking, interconnects, and distributed training setups, including NCCL and MPI
Proven track record of creating dependable and reproducible pipelines for training and evaluation
Knowledge of job schedulers such as SLURM and custom batch runners, as well as monitoring tools
Experience with custom deployments (no cloud, local clusters, edge devices)
Contributions to PyTorch or open-source ML tooling
Familiarity with infrastructure-as-code tools like Ansible, Terraform, and Nix
Experience in setting up logging, observability, and alerting for training runs
A strong dedication to writing clean code and meticulous engineering practices

WHAT'S ON OFFER

Work remotely in an environment that promotes open-source collaboration
Enjoy 14 days of leave and unlimited sick days
Access to GPUs, AI credits, opportunities for fast career progression, and other perks.

CONTACT

PEGASI – IT Recruitment Consultancy | Email: recruit@pegasi.com.vn | Tel: +84 28 3622 8666
We are PEGASI – IT Recruitment Consultancy in Vietnam. If you are looking for new opportunity for your career path, kindly visit our website www.pegasi.com.vn for your reference. Thank you!

Job Summary

Company Type:

Product

Technical Skills:

Machine Learning, Devops

Location:

Others - Singapore

Salary:

Negotiation

Job ID:

J01855

Status:

Active

Related Job:

Senior DevOps (Data Platform)

Ho Chi Minh - Viet Nam


Digital Bank, Product

  • Devops
  • Spark

Managing workloads on EC2 clusters using DataBricks/EMR for efficient data processing Collaborating with stakeholders to implement a Data Mesh architecture for multiple closely related enterprise entities Utilizing Infrastructure as Code (IaC) tools for defining and managing data platform user access Implementing role-based access control (RBAC) mechanisms to enforce least privilege principles Collaborating with cross-functional teams to design, implement, and optimize data pipelines and workflows Utilizing distributed engines such as Spark for efficient data processing and analysis Establishing operational best practices for data warehousing tools Managing storage technologies to meet business requirements Troubleshooting and resolving platform-related issues Staying updated on emerging technologies and industry trends Documenting processes, configurations, and changes for comprehensive system documentation.

Negotiation

View details

Senior Machine Learning Engineer

Ho Chi Minh, Ha Noi - Viet Nam


Information Technology & Services

  • Machine Learning

Creating the V1 Evaluation Platform: You will be responsible for designing and building the core backend systems for our new LLM Evaluation Platform, using Arize Phoenix as the basis for traces, evaluations, and experiments. Implementing Production Observability: You will need to architect and implement the observability backbone for our AI services by integrating Phoenix with OpenTelemetry to establish a centralized system for logging, tracing, and evaluating LLM behavior in production. Standardizing LLM Deployment Pipeline: You will be in charge of designing and implementing the CI/CD framework for versioning, testing, and deploying prompt-based logic and LLM configurations, ensuring reproducible and auditable deployments across all AI features. Providing Practical Solutions: Your role will involve making pragmatic technical decisions that prioritize business value and speed of delivery, in line with our early-stage startup environment. Collaborating with Other Teams: You will work closely with the Data Science team to understand their workflow and ensure that the platform you build meets their core needs for experiment tracking and validation. Establishing Core Patterns: You will also help in establishing and documenting the initial technical patterns for MLOps and model evaluation that will serve as the foundation for future development.

Negotiation

View details

Fullstack Engineer - BRAIN

Ho Chi Minh - Viet Nam


product, Investment Management

  • Frontend
  • Backend

Create intricate single page applications. Construct components that can be used across various interfaces. Design layouts that are responsive for both desktop and mobile devices. Automate the testing procedures for the user interface. Develop services and APIs for backend applications. Incorporate AWS and external cloud services. Enhance application speed and scalability. Actively contribute to an agile engineering team focused on continual improvement. Utilize leading open-source technologies like MySQL, PostgreSQL, ELK stack, Sentry, Redis, Git, etc. Take part in periodic on-call responsibilities.

Negotiation

View details