AI Agent Ops Engineer

JOB DESCRIPTION

Agent Engineering & operation
Design, build, and maintain production-grade AI agent systems, including: context engineering and instruction architecture, prompt hardening and safe execution boundaries, tool integrations and multi-step orchestration, memory strategies and reliability patterns.
Own the full agent lifecycle: prototype → evaluate → deploy → monitor → iterate.
Build and maintain an evaluation pipeline to measure agent quality, catch regressions, and enforce deployment gates (golden datasets, scenario suites, automated checks).
Instrument agents and agent platforms for production observability: structured logging, tracing, and metrics; latency and cost monitoring; tool-call success rates and failure analysis.
Define operational readiness standards including: rollback criteria, incident response playbooks, recovery paths for common failure modes.
Team Enablement & Coaching
Embed with product engineering teams to identify high-value use cases ready for agent automation. We will be operating in a Central Agent Ops role enabling Ai product builders through AI enablers.
Translate business workflows into agent-executable tasks with clear: contact boundaries/interfaces, assumptions and inputs/outputs, failure modes and safe fallbacks.
Deliver targeted coaching to engineers on: context engineering best practices, harness design and regression testing patterns, agent skill design and tool-contract discipline.
Reduce onboarding time for teams adopting AI capabilities-from first conversation to a production-ready agent.
Train product engineers to extend and maintain agent skills independently.
Standards & Knowledge operations
Author and maintain org-level standards for agents, including: naming conventions, context file structures and ownership rules, skill interface contracts (inputs/outputs, invariants, error handling), evaluation criteria and release quality bars.
Establish and enforce "repo-as-discipline" practices so agent knowledge is: versioned, reviewable, discoverable, reusable; not trapped in prompt snippets or individual heads.
Build and grow a shared agent skills library that teams can reuse and extend.
Track and aggregate AI tooling/framework updates and external best practices, serving as a central intake so product teams don't each have to follow the entire AI landscape.
Run internal knowledge-sharing sessions, showcases, and retrospectives to propagate learnings efficiently.

JOB REQUIREMENT

Solid 12+ years in Industry, Hands-on experience building and deploying production AI agents using modern frameworks (LangGraph, LangChain, OpenAI Agents SDK, trueAI, or equivalent).
Strong understanding of context engineering, including instruction architecture, token management, caching strategies, and latency-aware design.
Experience building evaluation pipelines: golden datasets and scenario libraries; automated quality gates and regression detection.
Familiarity with agent observability: tracing, structured logging, latency, and cost monitoring; tool-call reliability metrics and failure analysis.
Ability to design guardrails: output validation; prompt injection mitigation; safe execution boundaries for tools/actions.
Solid backend engineering skills; comfortable owning services/APIs end-to-end.
Strong communicator who can coach engineers, facilitate cross-team discussions, and write clear technical documentation.
Experience with production reliability and platform operations, including: event-driven architectures (Kafka and/or message queues); retries/backoff, DLQs, idempotency, ordering, backpressure; CDC/outbox-style patterns (or similar asynchronous reliability patterns); Kubernetes-based deployment and day-2 operations; CI/CD pipelines and infrastructure as code; on-call, incident response, postmortems, and SRE-style practices (SLOs/SLIs, runbooks).
Nice to have
Experience with RAG systems: ingestion, chunking, embeddings, hybrid search, retrieval evaluation.
Familiarity with MCP / Model Context Protocol or similar agent tooling standards (e.g., "MPTV"), and tool integration ecosystems.
Proficiency across Java/Kotlin (Spring Boot) and Python in production environments.
Who thrives in this role?
Engineers with an SRE/DevOps background pivoting into AI who naturally think about reliability, observability, and incident response.
Backend engineers with hands-on LLM/agent framework experience who want to work cross-functionally and enable multiple teams.
MLOps/LLM engineers who want to embed in product orgs and ship applied systems (not only model infrastructure).
Engineers who treat documentation, standards, and knowledge transfer as first-class engineering outputs.

WHAT'S ON OFFER

This is template

CONTACT

PEGASI – IT Recruitment Consultancy | Email: recruit@pegasi.com.vn | Tel: +84 28 3622 8666
We are PEGASI – IT Recruitment Consultancy in Vietnam. If you are looking for new opportunity for your career path, kindly visit our website www.pegasi.com.vn for your reference. Thank you!

Job Summary

Company Type:

Product

Technical Skills:

AI

Location:

Ho Chi Minh - Viet Nam

Working Policy:

Hybrid

Job ID:

J02192

Status:

Active

Related Job:

Software Engineer (Node.js) - Database

Ho Chi Minh - Viet Nam


Product

  • NodeJS

Design system architectures, establish coding standards, and construct cohesive, cloud-native solutions. Develop high-quality Node.js code, optimize system performance, and tackle complex software integration challenges. Oversee the testing, deployment, and comprehensive documentation of integrated systems. Mentor less-experienced engineers, engage in cross-functional teamwork, and ensure solutions meet business requirements and international standards. Participate actively in all Agile software development phases, including creating user stories and executing sprint planning Engage with multinational companies, demonstrating flexibility to occasionally adapt to US and EU time zones.

Negotiation

View details

Software Engineer (Node.js) - Platform Security

Ho Chi Minh - Viet Nam


Product

  • NodeJS

Design system architectures, establish coding standards, and construct cohesive, cloud-native solutions. Develop high-quality Node.js code, strengthen system security and reliability, and tackle complex software integration challenges. Design and implement platform security controls across web applications, APIs, and cloud services, including authentication, authorization, session management, secrets management, encryption, and audit logging. Identify and remediate security risks through threat modeling, secure code reviews, automated security testing, dependency scanning, and investigation of security-related issues. Oversee the testing, deployment, and comprehensive documentation of integrated systems. Mentor less-experienced engineers, engage in cross-functional teamwork, and ensure solutions meet business requirements and international standards. Participate actively in all Agile software development phases, including creating user stories and executing sprint planning Engage with multinational companies, demonstrating flexibility to occasionally adapt to US and EU time zones.

Negotiation

View details

Senior System Engineer

Ha Noi - Viet Nam


Outsource

  • System

Triển khai & Tích hợp: Cài đặt, cấu hình và tích hợp hệ thống máy chủ (Server), tủ đĩa (Storage), giải pháp sao lưu (Backup) và hệ thống giám sát (Monitoring) theo yêu cầu dự án. Thiết kế giải pháp: Đề xuất cấu hình kỹ thuật phù hợp với nhu cầu khách hàng, đảm bảo hiệu quả và khả năng mở rộng. Giải pháp DC/DR & Ảo hóa: Triển khai các giải pháp trung tâm dữ liệu (DC/DR), ảo hóa, clustering và replication. Phối hợp liên phòng ban: Làm việc chặt chẽ với các nhóm Network, Security, Application để đảm bảo tích hợp và tương thích hệ thống. Đào tạo & Chuyển giao: Thực hiện đào tạo và chuyển giao công nghệ cho khách hàng. Quản trị & Tối ưu hóa: Xử lý sự cố, tư vấn nâng cấp và tối ưu hóa hệ thống. Nghiên cứu & Cập nhật công nghệ: Liên tục cập nhật, nghiên cứu công nghệ mới và tham gia đánh giá các giải pháp mới cho công ty.

Negotiation

View details