Site Reliability Engineer

JOB DESCRIPTION

Maintain systems and troubleshoot system issues.
Identifying bottleneck in various Java applications and implement performance improvements.
Identify and analyze user requirements.
Prioritize, assign, and execute tasks throughout the software development life cycle.
Develop, configure, and deploy tools for cloud-based systems and services.
Containerize new and legacy applications.
Maintain awareness of new and emerging technologies.
Support development and operations teams.
Enhance, modify or debug developer code as needed.

JOB REQUIREMENT

Must-have
Understanding of an object-orientated language, preferably the latest version of Java (with experience in Hibernate, Multi-thread, Spring Boot)
Experience in configuration, in Jenkins for CI/CD pipeline creation, automation scripts and Kubernetes implementation with Google.
Proficiency in supporting a 24×7 critical operation.
Experience in a cloud computing platform and associated automation patterns it provides, preferably GCP.
Proficient in production systems design including High Availability, Disaster Recovery, Performance, Efficiency, and Security user, application performance, system, log, time-series, and dashboarding.
Familiarity with Open-Source concepts and tools like Prometheus, Grafana, ELK etc. 
Proficient in a modern infrastructure automation toolkit such as Terraform/Helm
Proficient in a Linux or Unix based environment.
Experience in destructive testing methodologies and tools such as chaos monkey
Experience in defensive coding practices and patterns for high availability
Nice-to-have
Experience in a cloud computing platform and associated automation patterns it provides, preferably GCP
Proficient in a modern scripting language like GO or Python
Knowledge of APM fundamentals or experience in tools like New Relic or AppDynamics.

WHAT'S ON OFFER

Open to deal base salary with additional project allowances.
Full salary during probation & Full coverage of social insurance.
Performance & salary review: twice a year
Monthly childcare support.
Premium Healthcare insurance and Health check-up services for employee and family ones.
15 Annual Leaves plus 10 days for Bereavement leave and 1.5 months for Paternity leave.
Premium package at top Gym service provider.
Diverse internal activities: Football, Billiards, Badminton, E-sport clubs & other regular company events.
Frequent opportunities to travel to US headquarter from 3-6 months.
Free parking for motorbike and car

CONTACT

PEGASI – IT Recruitment Consultancy | Email: recruit@pegasi.com.vn | Tel: +84 28 3622 8666
We are PEGASI – IT Recruitment Consultancy in Vietnam. If you are looking for new opportunity for your career path, kindly visit our website www.pegasi.com.vn for your reference. Thank you!

Job Summary

Company Type:

Outsource

Technical Skills:

Devops, Java

Location:

Ho Chi Minh, Da Nang - Viet Nam

Working Policy:

Salary:

Negotiation

Job ID:

J01196

Status:

Close

Related Job:

Partner Implementation Engineer (Security & Digital Trust)

Ha Noi - Viet Nam


Outsource

Đóng vai trò là người thực hiện triển khai chủ chốt, chịu trách nhiệm triển khai, cấu hình và tích hợp các giải pháp Security & Digital Trust (PKI, Chữ ký số, Mã hóa, MFA) vào hệ thống thực tế của khách hàng, đảm bảo hệ thống vận hành ổn định, bảo mật và đúng thiết kế. Triển khai hệ thống (Implementation) Chuẩn bị môi trường: kiểm tra hạ tầng (Server, Hệ điều hành, Cơ sở dữ liệu, Mạng) Cài đặt & cấu hình giải pháp: PKI / CA / Chữ ký số / MFA / Mã hóa Thiết lập chính sách bảo mật, quy trình nghiệp vụ Kết nối với thiết bị bảo mật (HSM, Quản lý Khóa) Triển khai trên nền tảng Cloud / Container (nếu có) Triển khai hệ thống trên Kubernetes / OpenShift Cấu hình tài nguyên (YAML: Pod, Dịch vụ, Ingress, Bản đồ Cấu hình, Bí mật) Thiết lập lưu trữ (Khối Lưu trữ Không gian); mạng nội bộ Áp dụng các chính sách bảo mật cho container Tích hợp hệ thống (Integration) Hỗ trợ tích hợp với: Trang web/ Ứng dụng/ Giao diện lập trình ứng dụng và IAM / SSO / AD / LDAP Hướng dẫn sử dụng API/SDK Kiểm tra luồng dữ liệu & bảo mật giao tiếp Phối hợp với nhóm khách hàng (Phát triển / Cơ sở hạ tầng / Bảo mật) Kiểm thử & nghiệm thu (QA/UAT) Thực hiện kiểm thử kỹ thuật & kịch bản vận hành Hỗ trợ UAT với khách hàng Kiểm tra tính đúng đắn của: Chữ ký số; Chứng thư và Luồng xác thực Vận hành & hỗ trợ Giám sát hệ thống, phân tích log, xử lý sự cố Hỗ trợ sau triển khai (L2/L3) Đảm bảo hệ thống hoạt động ổn định & HA Tài liệu & chuyển giao Xây dựng tài liệu triển khai (cấu trúc, cấu hình) Hướng dẫn vận hành cho khách hàng Đào tạo kỹ thuật cơ bản

Negotiation

View details

AI Product Builder

Ha Noi - Viet Nam


Product

  • AI
  • Backend
  • Frontend
  • Devops
  • Java
  • Golang
  • Product Management

Collaborate with domain experts to develop business requirements and constraints for designing prompt AI-assisted workflows and system specifications. Utilize AI tools, no-code/low-code, and coding to rapidly prototype UI/UX mockups and foundational implementations. Test prototypes through hypothesis validation cycles and provide detailed handovers to engineering teams. Decode legacy specifications and enhance existing products with AI-assisted analysis and implementation. Constantly enhance the product team's building-tooling, templates, and practices to adapt to changes in models and platforms.

Negotiation

View details

DevOps Engineer

Others - Viet Nam


Product

  • Devops
  • Kubernetes
  • Network

Managing and developing our Kubernetes platform across multiple clusters and environments including production, development, on-premises and public cloud. Designing and overseeing hybrid cloud infrastructure across on-premises and public clouds (such as GCP, AWS), including workload placement, cross-cloud networking, and unified resource management. Taking responsibility for the end-to-end CI/CD and GitOps process, including container build pipelines, image optimization, and progressive delivery using tools like ArgoCD/FluxCD. Taking charge of the observability stack to provide a comprehensive view across all clusters using tools like Grafana, Mimir, Tempo, Loki, Pyroscope, OnCall, Prometheus, and supporting agent-assisted SRE workflows. Managing and enhancing our inference platform, including vLLM serving and AIBrix for multi-model orchestration and autoscaling with a fleet of NVIDIA GPUs. Operating platform services such as Kafka, Redis, PostgreSQL, OpenSearch. Managing identity and access management with Keycloak integrated with Google Workspace, strengthening SSO, RBAC, and secrets management across the platform. Strengthening network security across private load balancers, firewalls, and VPC segmentation and designing and maintaining hub-and-spoke/multi-AZ topologies. Supporting training infrastructure with self-service VM provisioning, RunPod burst capacity, and Weights and Biases integration. Driving infrastructure reliability, cost efficiency, and capacity planning as the platform scales.

Negotiation

View details