SRE Lead/Manager (DevOps, AWS)

JOB DESCRIPTION

As a Support Site Reliability Engineer (SRE) leader, you will lead our efforts in establishing a support SRE team that works closely with The Company's Product SRE to increase productivity. The ideal candidate will utilize leadership and technical skills to streamline operational tasks affecting Product SRE team efficiency through collaboration with SRE teams located in Japan and Vietnam
Design and execute the Support SRE team's strategic roadmap.
Collaborate with The Company's Product SRE teams to identify opportunities for improving operational efficiency and reducing toil.
Mentor and coach team members to foster their growth and development in technical and collaboration areas.
Drive a culture of continuous improvement and knowledge sharing within the team.
Design and implement automation solutions to standardize operational tasks, reducing manual effort and improving efficiency.
Develop and maintain tools, scripts and processes to automate routine operational tasks.
Build, maintain, and improve our infrastructure, including monitoring, diagnosing, and resolving incidents promptly.
Participate in incident response, on-call rotations, and post-mortem analysis.

JOB REQUIREMENT

At least 5 years experience as a DevOps Engineer (Experience on on-premises environments being a plus) or similar.
3+ years of hands-on experience with AWS or other cloud platforms. Experience with managed AWS services is a plus.
Solid understanding of CI/CD pipelines and best practices.
Working understanding of containerization technologies (Docker and Kubernetes).
Experience with monitoring and logging solutions.
Proficiency with IaC (e.g., Terraform).
Deep understanding and hands-on experience with MySQL or similar relational databases.
Proven track record in training and educating team members, promoting a culture of continuous learning.
Strong ownership and responsibility, with a proactive and solutions-oriented mindset.
Experience in developing and operating web applications built in Go or Ruby is a plus.
Project management experience.
English language proficiency at a professional working level.
People management or team leadership experience is a plus.

WHAT'S ON OFFER

Caring Mental & Physical Recreation:
Hybrid working: 2 days at the office and 3 days WFH
Working hour: Flexible start 8AM-9AM from Mon-Fri
Full salary in probation
Insurance: Applied from Probation period:
Social Insurance, Health Insurance, Unemployment Insurance (on 100% salary)
Private health insurance & accident insurance. From Managing level: extra for family members
Bonus: 13th month salary
17 - 24 paid days off and more
Paternity leave: Extra 5 days
Annual company trip; Quarterly team building
Billiards & Running club
Annual health check
Well-equipped facility: Macbook pro, additional monitor, ..
Caring Career & Development:
Clear Career path
Foreign language & International technology-related certifications sponsoring
External & internal training courses
Soft-skill workshops
Tech seminars
Monthly and biannual Recognition Awards
Performance & salary review: twice/year (Jun & Dec)

CONTACT

PEGASI – IT Recruitment Consultancy | Email: recruit@pegasi.com.vn | Tel: +84 28 3622 8666
We are PEGASI – IT Recruitment Consultancy in Vietnam. If you are looking for new opportunity for your career path, kindly visit our website www.pegasi.com.vn for your reference. Thank you!

Job Summary

Company Type:

Product

Technical Skills:

Devops, AWS

Location:

Ha Noi - Viet Nam

Working Policy:

Salary:

Negotiation

Job ID:

J01508

Status:

Close

Related Job:

Engineering Manager - AI for RAN and 6G Wireless Systems

Ho Chi Minh, Ha Noi - Viet Nam


Product

  • Machine Learning
  • Management
  • AI

Manage and expand an engineering team focused on AI-enabled signal processing for the Radio Access Network (RAN). Supervise the development of deep learning models for various tasks related to RAN. Work with global teams to drive proof-of-concepts and production-quality AI-RAN components. Supervise the integration of AI models into full-stack simulations and/or testbeds using various frameworks. Align project priorities with hardware-software co-design constraints and deployment scenarios. Provide mentorship and guidance to team members, ensure technical excellence, and contribute to strategic direction.

Negotiation

View details

Director Engineering – Software Engineering and AI Inferencing Platforms

Ho Chi Minh, Ha Noi - Viet Nam


Product

  • Management
  • Backend
  • Devops
  • Data Engineering
  • Cloud
  • AI

Lead and expand engineering teams in Vietnam across system software, data science, and AI platforms. Drive the creation, structure, and delivery of high-performance system software platforms that support AI products and services. Collaborate with global teams across Machine Learning, Inference Services, and Hardware/Software integration to guarantee performance, reliability, and scalability. Oversee the development and optimization of AI delivery platforms in Vietnam, including NIMs, Blueprints, and other flagship services. Collaborate with open-source and enterprise data and workflow ecosystems to advance accelerated AI factory, data science, and data engineering workloads. Promote continuous integration, continuous delivery, and engineering best practices across multi-site R&D Centers. Work with product management and other stakeholders to ensure enterprise readiness and customer impact. Establish and implement standard processes for large-scale, distributed system testing including stress, scale, failover, and resiliency testing. Ensure security and compliance testing aligns with industry standards for cloud and data center products. Mentor and develop talent within the organization, fostering a culture of quality and continuous improvement.

Negotiation

View details

Principal Engineer, System Software Platform Engineering

Ho Chi Minh, Ha Noi - Viet Nam


Product

  • Devops
  • Backend
  • AI

Create and manage a platform for AI that provides services for multiple users, handles identity and policy management, configures quotas, and controls costs. Additionally, this platform should offer easy paths for teams to work on AI projects. Oversee the deployment of AI models at scale, including routing, autoscaling, and implementing safety measures to ensure reliability and observability. Manage GPU resources in a Kubernetes environment, including device plugins, feature discovery, and scheduling strategies, among other responsibilities. Take charge of the entire lifecycle of GPUs, ensuring that driver, firmware, and runtime updates are implemented safely and consistently. Implement virtualization strategies for GPU resources, such as vGPU and PCIe passthrough, while defining policies for resource placement, isolation, and preemptive actions. Establish secure traffic and networking protocols, including gateways, service mesh, and authentication/authorization measures. Enhance observability and operational efficiency through monitoring tools for GPUs, response protocols for incidents, and optimization of costs. Develop reusable templates, integrate SDKs and CLIs, and implement infrastructure-as-code standards for the platform. Influence the platform's direction by creating design documents, mentoring engineers, and aligning platform development with the needs of AI products.

Negotiation

View details