DevOps Engineer

JOB DESCRIPTION

Operate and evolve our Kubernetes platform across multiple clusters and environments (Prod, Dev, hybrid on-prem and public cloud), covering control plane operations, node lifecycle, upgrades, and autoscaling at every layer (Cluster Autoscaler, HPA, KEDA).
Architect and manage hybrid cloud infrastructure spanning on-premises and public clouds (GCP, AWS), including workload placement, cross-cloud networking, and unified resource management.
Own the CI/CD and GitOps experience end-to-end: container build pipelines, image optimization, and progressive delivery via ArgoCD / FluxCD.
Own the observability stack as a single pane of glass across all clusters: Grafana, Mimir, Tempo, Loki, Pyroscope, OnCall, Prometheus -- and help push toward agent-assisted SRE workflows.
Manage and improve our inference platform: vLLM serving and AIBrix for multi-model orchestration and autoscaling across a fleet of NVIDIA GPUs.
Operate platform services: Kafka, Redis, PostgreSQL, OpenSearch.
Manage identity and access via Keycloak integrated with Google Workspace; harden SSO, RBAC, and secrets management across the platform.
Harden network security across private load balancers, firewalls, and VPC segmentation; design and maintain hub-and-spoke / multi-AZ topologies.
Support training infrastructure: self-service VM provisioning, RunPod burst capacity, Weights and Biases integration.
Drive infrastructure reliability, cost efficiency, and capacity planning as the platform scales.

JOB REQUIREMENT

Kubernetes -- deep, hands-on. Strong production experience with Kubernetes, fluent in workloads and controllers, networking (Services, Ingress, CNI basics), storage (PV/PVC, CSI), RBAC, and the autoscaling story end-to-end (HPA, VPA, Cluster Autoscaler, KEDA). Cloud-managed Kubernetes (GKE, EKS, AKS) is fine; on-premises / self-managed Kubernetes (kubeadm, Cluster API, k3s, etc.) is a strong plus.
Networking -- design-level, not just operator-level. You have designed real network topologies at some point in your career -- hub-and-spoke, multi-AZ / multi-VPC, or an equivalent enterprise pattern -- and can defend the tradeoffs. Comfortable with VPCs, firewalls, load balancers, private cluster architecture, DNS, and routing. On-premises networking experience (VLANs, BGP, L2/L3 fabrics, pfSense / Fortinet / Palo Alto / Cisco) is a strong plus.
CI/CD and Docker -- concepts over tooling. You can build and optimize Dockerfiles (multi-stage builds, layer caching, small/secure base images) and have owned full CI/CD pipelines end-to-end. Tooling is flexible -- GitHub Actions, GitLab CI, Azure Pipelines, Jenkins, Argo Workflows, etc. -- but you should be able to clearly articulate the full lifecycle of a typical pipeline, and explain how CI/CD changes when the deployment target is Kubernetes (ArgoCD / FluxCD, GitOps patterns, progressive delivery).
Observability -- you have built this before. You have stood up a full observability stack from scratch and operated it in production -- metrics, logs, traces, alerting, on-call. Familiarity with the Grafana stack (Grafana, Mimir, Tempo, Loki, Pyroscope, OnCall, Prometheus) is a strong plus. Bonus points if you have experimented with agent-assisted SRE workflows or LLM-driven incident triage.
SSO and identity. When you bring a new tool into the platform, your instinct is to wire it into a central IdP rather than leave it on local accounts. Comfortable with OpenID Connect, SAML, and traditional directory services (LDAP / Active Directory), and you have integrated tools with an IdP like Keycloak, Okta, Azure AD, or equivalent.
Linux and automation fundamentals. Strong Linux proficiency (RHEL/Ubuntu or equivalent) including basic performance and networking debugging. Comfort with infrastructure-as-code (Terraform / Terragrunt / Pulumi or equivalent) and configuration management.
Ownership mindset. Comfortable operating in a high-ownership environment where you make architecture decisions, push them to production, and own the outcomes.
Optional but valuable: hands-on experience operating any of Kafka, Redis, PostgreSQL, OpenSearch -- at production scale, including HA, backup/restore, and upgrade planning.
Bonus points for:
Experience with OpenStack in production: Nova, Neutron, Cinder, Trove, Horizon, and CLI administration.
Experience with KVM virtualization and storage backends like Ceph or Rook-Ceph on Kubernetes.
Familiarity with vLLM internals: PagedAttention, continuous batching, tensor parallelism.
Background in AI/ML infrastructure or GPU cluster operations at scale.
Experience with KEDA or event-driven autoscaling patterns in anger.
Prior open-source contributions to Kubernetes, OpenStack, or adjacent projects.
Kernel-level Linux debugging and performance tuning.

WHAT'S ON OFFER

Collaborate with a world-class research team on meaningful, high-impact projects
Own and shape the core training code infrastructure used daily by the team
Work on real models, real data, and real scale - not toy problems
Help bridge the gap between research velocity and engineering quality
Flexible work environment with a culture that values depth, clarity, and curiosity

CONTACT

PEGASI – IT Recruitment Consultancy | Email: recruit@pegasi.com.vn | Tel: +84 28 3622 8666
We are PEGASI – IT Recruitment Consultancy in Vietnam. If you are looking for new opportunity for your career path, kindly visit our website www.pegasi.com.vn for your reference. Thank you!

Job Summary

Company Type:

Product

Technical Skills:

Devops, Kubernetes, Network

Location:

Others - Viet Nam

Working Policy:

Onsite

Salary:

Negotiation

Job ID:

J02107

Status:

Active

Related Job:

Platform Engineer

Ho Chi Minh - Viet Nam


Product

  • Backend
  • Devops
  • Data Engineering

Build and maintain distributed infrastructure handling telemetry, sensory, and control data across cloud and edge environments Design and operate data ingestion and streaming pipelines connecting robot fleets to the cloud in real time, covering video, joint states, audio, and LiDAR Develop and maintain backend services and APIs that power the Company's developer-facing platform, with a focus on reliability and developer experience Manage and evolve cloud native infrastructure using Kubernetes, Docker, and infrastructure as code tooling Ensure platform reliability through monitoring, alerting, autoscaling, failover, and incident response Support ML and robotics teams with data infrastructure for training pipelines, policy rollout, and hardware-in-the-loop simulation Implement secure APIs with access control, rate limiting, and usage metering as we scale

Negotiation

View details

Software Engineer (Digital Twin)

Ho Chi Minh - Viet Nam


Product

  • Python
  • C/C++

Build and maintain high-fidelity digital twin environments for Asimov across MuJoCo, Isaac Sim, and Unreal Engine, calibrated to real hardware behavior. Design and own the systems -- not just the environments -- that let locomotion, autonomy, and perception teams generate, validate, and iterate on simulation scenarios at scale. Build pipelines for asset import, USD and MJCF workflows, sensor modeling, and real-to-sim calibration to keep digital twins synchronized with evolving hardware. Develop photorealistic rendering pipelines in Unreal Engine for synthetic data generation and perception model training. Work with hardware and mechatronics teams to model actuator dynamics, contact physics, and structural behavior, ensuring simulation parameters reflect physical ground truth. Integrate digital twin environments with the Company's locomotion training pipeline (Cyclotron) and autonomy stack, enabling teams to run experiments and close the sim-to-real gap. Contribute to the open-source Asimov simulation stack, including tooling, documentation, and reproducible environment workflows.

Negotiation

View details

Senior iOS Engineer

Ho Chi Minh - Viet Nam


Product

  • iOS

Contribute to the ongoing performance optimization of the iOS SDK as part of a long-term project focused on improving reliability, functionality, and efficiency of geolocation services. Assist in automating operational tasks such as releases and production support in order to enhance productivity and allocate resources for innovation. Conduct research on new features, analyze requirements and competitors, estimate implementation, design software, conduct code reviews, and document solutions. Work closely with the client-facing team to address issues for clients, ensuring a seamless experience for end users and developers. Embrace industry best practices to innovate the iOS SDK for fast, reliable, secure, and high-performance applications. Drive the development, scaling, and optimization of geolocation and anti-fraud products aligned with the organization's mission and customer needs. Collaborate with cross-functional teams, including product management and business stakeholders, to define product requirements and translate them into technical solutions. Advocate for a user-centric approach to product development, ensuring intuitive, efficient, and valuable solutions. Collaborate with a global team to develop industry-leading technologies for anti-fraud and compliance solutions. Stay informed about emerging spoofing techniques and proactively adjust systems to maintain strong security. Design and implement features that enable business users to define and execute real-time anti-fraud rules.

Negotiation

View details