Lead Site Reliability Engineer

JOB DESCRIPTION

Leadership & Mentorship: Lead a team of SREs, providing technical guidance, coaching, and fostering a culture of reliability and continuous improvement.
SRE Practices: Define and mature SRE practices, including SLIs/SLOs, error budgets, and incident response processes across production systems.
Architecture & Automation: Own the design and evolution of automated cloud operations, driving adoption of Infrastructure-as-Code (Terraform, CloudFormation) and CI/CD pipelines.
Incident Management: Lead major incident responses, ensuring rapid resolution, root cause analysis, and implementation of preventive measures.
Collaboration: Work closely with Development, DevOps, and Cloud Engineering teams to ensure reliability and resilience are built into every stage of delivery.
Operational Excellence: Establish and track key reliability metrics (availability, latency, error rates) and drive initiatives to continuously improve them.
Innovation & Tooling: Evaluate and implement AWS-native and third-party tools to improve monitoring, alerting, and automation.
Stakeholder Engagement: Act as the primary contact point for Service Reliability topics with clients, ensuring transparency and alignment on reliability goals.
Governance: Ensure compliance with industry standards and internal policies around security, audit, and operational risk.

JOB REQUIREMENT

Minimum 7 years of experience as an SRE Engineer, with exposure to data platform solutions being an advantage.
Extensive experience with AWS, including IAM, ECS, EKS, Lambda, and CloudWatch.
Expertise in deploying and managing containerized services, especially on Kubernetes.
Hands-on experience with Infrastructure-as-Code and automation tools like Terraform or Scalr.
Strong knowledge of cloud architecture, with a focus on maintaining service SLAs and ensuring high availability.
Experience with cloud security practices, SSO solutions, and authentication protocols (e.g., Auth0, SAML/OIDC, OAuth).
Familiarity with deploying and maintaining data processing frameworks and ML platforms such as Airflow, Airbyte, Superset, Metabase, Databricks, Snowflake, MLflow, etc., is a strong advantage.
Nice-to-Have Skills
Certifications such as AWS Certified DevOps Engineer - Professional or AWS Solutions Architect - Professional.
Experience in financial services or other highly regulated industries.
Knowledge of advanced security practices and compliance frameworks (PCI-DSS, ISO 27001, SOC2).
Multi-region/multi-AZ architecture design for high availability and disaster recovery.

WHAT'S ON OFFER

Competitive Compensation
Benefits package including comprehensive medical, dental, vision and others
Company Culture based on our Core Values
Professional Development Training with Individual Development Plans to map out your career growth
Opportunity to work in a global environment with diverse teams built with colleagues from around the world
Opportunity to work with technology industry leaders in the financial services industry
Opportunity to work for big name clients in capital markets, banking and other industries

CONTACT

PEGASI – IT Recruitment Consultancy | Email: recruit@pegasi.com.vn | Tel: +84 28 3622 8666
We are PEGASI – IT Recruitment Consultancy in Vietnam. If you are looking for new opportunity for your career path, kindly visit our website www.pegasi.com.vn for your reference. Thank you!

Job Summary

Company Type:

Information Technology & Services

Technical Skills:

System, Devops

Location:

Ho Chi Minh, Ha Noi - Viet Nam

Salary:

Negotiation

Job ID:

J00771

Status:

Active

Related Job:

Tech Lead Software Developer (Delphi, Oracle PL-SQL)

Ho Chi Minh - Viet Nam


Global Software Delivery Centers

In the role, you will work with a dedicated, enthusiastic team developing enhancements to a Laboratory Information Management System (a LIMS) and supporting the LIMS' multiple customers.As a Tech Lead Delphi Developer, you will: Lead Development teams in a local line management capacity reporting to the Software Engineering Manager in Europe Set targets and mentor the team/s locally Ensure quality in team development Participate in sprint planning and sprint retrospective meetings Assign and deliver development tasks requested during sprint planning Estimate the complexity and the workload Choose the most appropriate technical solution to meet the user requirements Design, develop and deploy changes to the LIMS according to the customers' and business users' requirements. Support the LIMS in collaboration with the other team members Working together with other team members (Engineers/QA) to ensure high quality of delivered solutions. Ensuring good practices and high quality standards are implemented and followed

Negotiation

View details

Senior DevOps (Data Platform)

Ho Chi Minh - Viet Nam


Digital Bank, Product

  • Devops
  • Spark

Managing workloads on EC2 clusters using DataBricks/EMR for efficient data processing Collaborating with stakeholders to implement a Data Mesh architecture for multiple closely related enterprise entities Utilizing Infrastructure as Code (IaC) tools for defining and managing data platform user access Implementing role-based access control (RBAC) mechanisms to enforce least privilege principles Collaborating with cross-functional teams to design, implement, and optimize data pipelines and workflows Utilizing distributed engines such as Spark for efficient data processing and analysis Establishing operational best practices for data warehousing tools Managing storage technologies to meet business requirements Troubleshooting and resolving platform-related issues Staying updated on emerging technologies and industry trends Documenting processes, configurations, and changes for comprehensive system documentation.

Negotiation

View details

Senior Machine Learning Engineer

Ho Chi Minh, Ha Noi - Viet Nam


Information Technology & Services

  • Machine Learning

We are seeking a pragmatic Senior Machine Learning Engineer to accelerate our MLOps roadmap. Your primary mission will be to own the design and implementation of our V1 LLM Evaluation Platform, a critical system that will serve as the quality gate for all our AI features. You will be a key builder on a new initiative, working alongside dedicated Data Engineering and DevOps experts to deliver a tangible, high-impact platform. This role is for a hands-on engineer who thrives on building robust systems that provide leverage. You will be fully empowered to own the implementation and success of this project Build the V1 Evaluation Platform: Proactively own the end-to-end process of designing and building the core backend systems for our new LLM Evaluation Platform, leveraging Arize Phoenix as the foundational framework for traces, evaluations, and experiments. Implement Production Observability: Architect and implement the observability backbone for our AI services, integrating Phoenix with OpenTelemetry to create a centralized system for logging, tracing, and evaluating LLM behavior in production. Standardize LLM Deployment Pipeline: Design and implement the CI/CD framework for versioning, testing, and deploying prompt-based logic and LLM configurations, ensuring reproducible and auditable deployments across all AI features. Deliver Pragmatic Solutions: Consistently make pragmatic technical decisions that prioritize business value and speed of delivery, in line with our early-stage startup environment. Cross-functional Collaboration: Work closely with our Data Science team to understand their workflow and ensure the platform you build meets their core needs for experiment tracking and validation. Establish Core Patterns: Help establish and document the initial technical patterns for MLOps and model evaluation that will serve as the foundation for future development.

Negotiation

View details