DevOps Engineer

ABOUT CLIENT

Our client is a leading research company specializing in technology innovation

JOB DESCRIPTION

Managing and developing our Kubernetes platform across multiple clusters and environments including production, development, on-premises and public cloud.

Designing and overseeing hybrid cloud infrastructure across on-premises and public clouds (such as GCP, AWS), including workload placement, cross-cloud networking, and unified resource management.

Taking responsibility for the end-to-end CI/CD and GitOps process, including container build pipelines, image optimization, and progressive delivery using tools like ArgoCD/FluxCD.

Taking charge of the observability stack to provide a comprehensive view across all clusters using tools like Grafana, Mimir, Tempo, Loki, Pyroscope, OnCall, Prometheus, and supporting agent-assisted SRE workflows.

Managing and enhancing our inference platform, including vLLM serving and AIBrix for multi-model orchestration and autoscaling with a fleet of NVIDIA GPUs.

Operating platform services such as Kafka, Redis, PostgreSQL, OpenSearch.

Managing identity and access management with Keycloak integrated with Google Workspace, strengthening SSO, RBAC, and secrets management across the platform.

Strengthening network security across private load balancers, firewalls, and VPC segmentation and designing and maintaining hub-and-spoke/multi-AZ topologies.

Supporting training infrastructure with self-service VM provisioning, RunPod burst capacity, and Weights and Biases integration.

Driving infrastructure reliability, cost efficiency, and capacity planning as the platform scales.

JOB REQUIREMENT

Strong production experience with Kubernetes, including workloads and controllers, networking, storage, RBAC, and autoscaling. Familiarity with both cloud-managed and on-premises/self-managed Kubernetes is a plus.

Design-level networking experience, including the ability to defend tradeoffs in real network topologies such as hub-and-spoke, multi-AZ/multi-VPC, and equivalent enterprise patterns. Comfortable with VPCs, firewalls, load balancers, private cluster architecture, DNS, and routing. On-premises networking experience is a strong plus.

Proficiency in building and optimizing Dockerfiles and owning full CI/CD pipelines. Experience with CI/CD changes when deploying to Kubernetes is a bonus.

Previous experience in setting up and operating a full observability stack in production, including metrics, logs, traces, and alerting. Familiarity with the Grafana stack is a strong plus.

Comfort with SSO and identity, including integrating tools with a central IdP.

Strong Linux proficiency, infrastructure-as-code, and configuration management skills.

An ownership mindset and familiarity with operating at high-ownership environments.

Hands-on experience with Kafka, Redis, PostgreSQL, or OpenSearch at a production scale is optional but valuable.

Bonus points for experience with OpenStack and KVM virtualization, familiarity with vLLM internals, a background in AI/ML infrastructure or GPU cluster operations, experience with KEDA or event-driven autoscaling patterns, prior open-source contributions, and kernel-level Linux debugging and performance tuning.

WHAT'S ON OFFER

Join a renowned research team to work on impactful projects

Take ownership of the core training code infrastructure used by the team

Engage with real models, data, and scale, rather than small-scale problems

Contribute to bridging the gap between research velocity and engineering quality

Enjoy a flexible work environment with a culture that values depth, clarity, and curiosity

CONTACT

PEGASI – IT Recruitment Consultancy | Email: recruit@pegasi.com.vn | Tel: +84 28 3622 8666

We are PEGASI – IT Recruitment Consultancy in Vietnam. If you are looking for new opportunity for your career path, kindly visit our website www.pegasi.com.vn for your reference. Thank you!

Job Summary

Company Type:

Product

Technical Skills:

Devops, Kubernetes, Network

Location:

Others - Singapore

Working Policy:

Hybrid, Onsite

Job ID:

J02107

Status:

Related Job:

Senior AI DevSecOps Engineer

Ho Chi Minh - Viet Nam

Product

Devops
AWS
Azure
Security

CI/CD Pipeline Management: Build and manage the CI/CD pipeline to ensure automation, security, and scalability across all stages of the development lifecycle. Infrastructure & Security: Design and implement secure multi-cloud infrastructure solutions leveraging cloud services, containerization, and orchestration tools (e.g., Kubernetes, Docker). Policy as Code: Define and enforce security and compliance policies across Kubernetes clusters using OPA (Open Policy Agent) or Kyverno, ensuring guardrails are automated and auditable. AI & Platform Automation: Drive the adoption of AI-powered tools and workflows to automate infrastructure operations, optimise CI/CD pipelines, accelerate root cause analysis, improve security posture, and enhance engineering productivity. Observability & Alerting: Build and maintain a comprehensive observability stack (NewRelic, Prometheus, Grafana, ELK/EFK, or Azure Monitor) with proactive alerting, dashboards, and runbooks for critical business flows and security events. Secret & Credential Management: Design and enforce secrets management practices across all environments using, Azure Key Vault, AWS Secrets Manager, 1Password, ensuring zero hardcoded credentials in codebases and pipelines. Incident Response & On-Call: Own and continuously improve incident response processes - define runbooks, lead post-mortems, track MTTR, and participate in on-call rotation to maintain platform reliability and SLO adherence. Threat Modelling & Penetration Testing: Conduct regular threat modelling sessions with engineering teams and coordinate or perform penetration testing activities to proactively identify attack surfaces before they reach production. Code Security: Conduct regular code reviews and static/dynamic analysis to identify and remediate security vulnerabilities. Compliance and Best Practices: Ensure compliance with industry standards and best practices, including GDPR, ISO, PCI-DSS, and others. Collaboration: Collaborate with development, operations, and security teams to foster a culture of automation and security-first thinking. Mentorship: Mentor junior engineers and other team members on security best practices. Documentation: Maintain thorough and up-to-date documentation of security policies, procedures, and incident reports. Trend Scouting: Stay updated with the latest trends in technology and AI to integrate innovative solutions into our processes.

Negotiation

View details

Senior Data Engineer (C++, Python, AI/LLM)

Ho Chi Minh - Viet Nam

Outsource

Data Engineering
C/C++
Python

Enrich a wide range of structured and unstructured data into high-quality datasets for quantitative analysis and financial engineering. Enhance data quality and integrity by developing validation tools and frameworks to measure the effectiveness of data enrichment pipelines. Develop a deep understanding of machine learning, deep learning, and emerging AI/LLM applications, and analyze the underlying dynamics and behaviors within the data. Generate insights from large-scale datasets and collaborate with research teams to identify opportunities for tradable signals. Design and develop utility tools to automate software development, testing, deployment, and monitoring workflows. Provide technical support for global researchers, including diagnosing root causes of technical issues, troubleshooting Python and C++ code, and proposing scalable fixes and improvements. Investigate, debug, and resolve issues in C++ applications and data pipelines, with a strong focus on performance, stability, correctness, and maintainability. Explore and apply AI/LLM-based solutions to improve data processing, workflow efficiency, troubleshooting, documentation, and research support processes.

Negotiation

View details

Senior Backend Engineer

Ho Chi Minh, Ha Noi - Viet Nam

Product

Golang
Ruby

Design, build, and operate backend services that power a leading B2B SaaS platform for the construction industry. Design clean, maintainable, and extensible software architecture. Identify and resolve performance bottlenecks (query optimization, caching, async processing). Drive software quality through automated testing, code reviews, observability, and CI/CD practices. Review and validate AI-generated code to ensure maintainability, correctness, and security. Mentor engineers and raise the team's engineering bar.#Development environment: Backend: Ruby (on Rails), Golang, Amazon Aurora (MySQL), DynamoDB FrontEnd: Nuxt.js, Vue.js, Next.js, ReactJS Mobile App: Kotlin, Swift, Flutter Deploy/ Build: AWS Amplify, CodePipeline, CodeBuild, CircleCI, GitHub Actions Others: Swagger, Docker, Figma, Confluence, JIRA, esa, gRPC Infrastructure: Helm, Terraform Automation test: Autify, Magicpod

Negotiation

View details