AI Agent Ops Engineer

ABOUT CLIENT

Our client is a big fintech company from Japan

JOB DESCRIPTION

Responsible for the design, construction, and upkeep of production-grade AI agent systems, including areas such as context engineering, instruction architecture, secure execution boundaries, tool integrations, multi-step orchestration, memory strategies, and reliability patterns.

Manage the complete lifecycle of agents from prototyping to deployment, monitoring, and iterating.

Develop and maintain an evaluation pipeline to measure agent quality, identify regressions, and enforce deployment gates through the use of golden datasets, scenario suites, and automated checks.

Instrument agents and agent platforms for production observability, such as structured logging, tracing, metrics, latency monitoring, cost monitoring, and analysis of tool-call success rates and failures.

Establish operational readiness standards, including rollback criteria, incident response playbooks, and recovery paths for common failure modes.

Collaborate with product engineering teams to identify high-value use cases suitable for agent automation, operating in a Central Agent Ops role to enable AI product builders through AI enablers.

Translate business workflows into tasks executable by agents and provide coaching to engineers on context engineering best practices, harness design, regression testing patterns, agent skill design, and tool-contract discipline.

Streamline the onboarding process for teams adopting AI capabilities and train product engineers to independently extend and maintain agent skills.

Develop and maintain organizational standards for agents, including naming conventions, context file structures, skill interface contracts, evaluation criteria, and release quality benchmarks.

Establish and enforce "repo-as-discipline" practices to ensure that agent knowledge is versioned, reviewable, discoverable, and reusable.

Cultivate a shared agent skills library for teams to reuse and extend, while keeping track of AI tooling/framework updates and external best practices to provide centralized information to product teams.

Facilitate internal knowledge-sharing sessions, showcases, and retrospectives to efficiently propagate learnings.

JOB REQUIREMENT

Extensive 12+ years of industry experience in building and deploying production AI agents using modern frameworks.

Proficient in context engineering, including instruction architecture, token management, caching strategies, and latency-aware design.

Experienced in developing evaluation pipelines, automated quality gates, and regression detection.

Familiar with agent observability, including tracing, structured logging, latency, and cost monitoring, as well as tool-call reliability metrics and failure analysis.

Capable in designing guardrails, output validation, prompt injection mitigation, and safe execution boundaries for tools/actions.

Strong backend engineering skills with the ability to own services/APIs end-to-end.

Effective communicator with the ability to coach engineers, facilitate cross-team discussions, and write clear technical documentation.

Experience in production reliability and platform operations, including event-driven architectures, retries/backoff, DLQs, idempotency, ordering, backpressure, CDC/outbox-style patterns, Kubernetes-based deployment and day-2 operations, CI/CD pipelines, infrastructure as code, on-call, incident response, postmortems, and SRE-style practices.

Experience with RAG systems, ingestion, chunking, embeddings, hybrid search, and retrieval evaluation.

Familiarity with MCP/Model Context Protocol or similar agent tooling standards and tool integration ecosystems.

Proficiency across Java/Kotlin (Spring Boot) and Python in production environments.

Engineers with SRE/DevOps background transitioning into AI, who naturally think about reliability, observability, and incident response.

Backend engineers with hands-on LLM/agent framework experience, willing to work cross-functionally and enable multiple teams.

MLOps/LLM engineers interested in embedding in product organizations and shipping applied systems, not only model infrastructure.

Engineers who prioritize documentation, standards, and knowledge transfer as first-class engineering outputs.

WHAT'S ON OFFER

CONTACT

PEGASI – IT Recruitment Consultancy | Email: recruit@pegasi.com.vn | Tel: +84 28 3622 8666

We are PEGASI – IT Recruitment Consultancy in Vietnam. If you are looking for new opportunity for your career path, kindly visit our website www.pegasi.com.vn for your reference. Thank you!

Job Summary

Company Type:

Product

Technical Skills:

AI, Backend

Location:

Ho Chi Minh - Viet Nam

Working Policy:

Hybrid

Job ID:

J02192

Status:

Active

Related Job:

Senior Backend Engineer (Shop 6.0)

Ho Chi Minh - Viet Nam

Outsource

NodeJS
Azure

Design and develop the backend services for the core areas of Shop 6.0 (catalog, orders, payments, ERP Cloud integration) Define API boundaries and ensure consistent, scalable communication between services Own reliable data synchronization between the ERP Cloud and the platform, including error handling and recovery mechanisms Optimize systems for performance and scalability (caching, asynchronous processing, read/write optimization) Establish observability standards (logging, monitoring, alerting) across all backend services Conduct code reviews, promote best practices, and support less experienced team members

Negotiation

View details

Senior Tech Lead (Shop 6.0)

Ho Chi Minh - Viet Nam

Outsource

Backend
Frontend
Azure

Take on overall technical and organizational responsibility for delivering Shop 6.0 across frontend, backend, and infrastructure Plan delivery scope, milestones, and releases, and coordinate the work of parallel workstreams (frontend, backend, DevOps, QA) Make and facilitate key architectural decisions together with the senior engineers, particularly around microservices, APIs, and cloud-native implementation on Azure Identify and manage risks related to integration (especially the ERP Cloud connection), scalability, and technical dependencies Serve as the central point of contact for stakeholders on scope, prioritization, and timelines Ensure code quality through reviews, mentoring, and clear standards, including a pull request process with mandatory checks and senior/lead review Ensure transparency on progress, risks, and decisions towards the team and management

Negotiation

View details

Product Engineer (Flutter)

Ho Chi Minh - Viet Nam

Product

Flutter

Own products and features from beginning to end - originate ideas, understand the strategy and goals behind them, ship to production with real users, iterate on feedback, and stay responsible for their ongoing success. Scope features and projects, including writing handover / technical design documents (as built) - the person who built a feature is the best person to document it. Improve application performance - ship technical improvements that make the product faster, more stable, and ready to scale. Sweat the UI and interaction details - crafting a clean, pleasant experience is a daily habit, not an afterthought. Improve how we work, not just what we build: refine development processes, drive improvements in code quality and engineering best practices - and hold a zero-bug standard: bugs get fixed promptly, not stockpiled in a backlog. Work closely with designers to craft thoughtful UX/UI - ship and own the basic experience using the design system, deciding when to bring in design to polish - and collaborate with the Customer Success team to learn how the product is actually being used. Bring a tester's mindset - testing is part of building, not someone else's job. You own the quality of what you ship and stand behind it.

Negotiation

View details