Platform Reliability Engineer

JOB DESCRIPTION

We are seeking a Platform Reliability Engineer, you will play a key role in managing and optimizing the operational aspects of the server and network infrastructure for a large financial buy-side organization. Your primary focus will be on reducing operational overhead, optimizing systems, managing configurations, and ensuring the reliability and performance of critical infrastructure.
Responsibilities:
Ensure the production reliability of the firm Linux-based research and trading platform as part of a globally distributed engineering team.
Provide rapid emergency response to production infrastructure issues.
Proactively understand internal client needs and effectively communicate them to leadership at both regional and global levels.
Identify risks, develop contingency plans, and implement solutions to mitigate them.
Develop and enhance the observability platform to monitor the performance and health of critical computing environments.
Participate in occasional (monthly) on-call rotations and support on-call staff during their shifts.
Contribute to organizational knowledge through documentation, education, and writing maintainable code.

JOB REQUIREMENT

2+ years of experience in SRE, DevOps, or other infrastructure engineering roles, preferably within the financial industry.
Strong understanding of Linux system internals, including kernel operations, memory management, and performance optimization.
In-depth knowledge of storage technologies, particularly those used in high-performance computing (GPFS experience is a plus).
Broad understanding of IT infrastructure components, such as networking, DNS, NTP/PTP, and NIS.
Proficiency in system automation, monitoring, and self-healing (experience with Salt is a plus).
Experience with container orchestration and virtualization technologies (e.g., Kubernetes, Nomad, VMware).
Familiarity with on-premises and cloud-based HPC infrastructure (operational knowledge of Slurm and GPU is a plus).
Understanding of AI technologies and their applications in infrastructure automation and management.
Experience with or a strong interest in implementing AI/ML solutions for infrastructure optimization, anomaly detection, or predictive analytics.
A passion for technology and automation, with a deep sense of curiosity and ownership.
A hands-on approach to problem-solving and a demonstrable enthusiasm for technology.
Excellent verbal and written English communication skills.

WHAT'S ON OFFER

Workplace:
Join a vibrant, young and dynamic team working on cutting-edge projects & emerging technologies.
Collaborate with global experts & top tech talent to enhance your skills.
Thrive in a culture of openness, forward-thinking and innovation-driven team while encouraging your full potential.
Benefits Comprise:
Competitive salary, 13th month salary and attractive performance bonuses.
Flexible hybrid working model with WFH 2 days per week.
Premium Healthcare and Accident insurance.
Annual health check package.
Free parking and allowances: Lunch, Marriage, Newborn baby, Bereavement and others applied.
A spacious pantry that is fully equipped with coffee maker, fridge, microwave and more for your most comfortable lunch time.
A wide range of sport and social activities: Yoga, Football, Badminton, Tech clubs, etc.
Annual company trip and teambuilding.
Chance to be honored quarterly and annually with recognition awards for individuals, teams, long-term service, etc.
Advanced English and appropriate soft skills training to assist your career development.
Engaging monthly events: Happy Gathering, Mini Game, Team Birthday Celebrations, Company's Year-end party, etc.
Exclusive company supporting funds to ease your personal loans of Home, Vehicle, Tuition, etc.

CONTACT

PEGASI – IT Recruitment Consultancy | Email: recruit@pegasi.com.vn | Tel: +84 28 3622 8666
We are PEGASI – IT Recruitment Consultancy in Vietnam. If you are looking for new opportunity for your career path, kindly visit our website www.pegasi.com.vn for your reference. Thank you!

Job Summary

Company Type:

Outsource

Technical Skills:

Location:

Ho Chi Minh - Viet Nam

Working Policy:

Onsite

Salary:

Negotiation

Job ID:

J01977

Status:

Active

Related Job:

Storage System Engineer (Linux)

Ho Chi Minh - Viet Nam


Outsource

We are seeking a highly motivated and talented individual to join our team as a Storage Operations Engineer. This role requires a strong understanding of storage systems, automation skills, and experience in Linux system administration. As a Storage and Linux Operations Engineer, you will be responsible for managing and optimizing our storage infrastructure while actively contributing to automation initiatives and providing Linux system administration support.#Responsibilities: Monitor storage performance, capacity, and availability to ensure optimal performance and reliability. Troubleshoot storage-related issues and provide timely resolutions to users spanning across the globe. Develop and maintain scripts and automation tools to streamline storage administration tasks. Perform regular data backup and recovery procedures to ensure data availability.

Negotiation

View details

Operation Engineer (Python, English)

Ho Chi Minh - Viet Nam


Outsource

  • Python

As an Operation Engineer, you will join the core team powering the daily operations of a fast-paced, global tech environment. Your work directly supports researchers, engineers, and teams across the company ensuring a seamless workflow and continuous improvements.#Your key responsibilities: Handle and resolve operational issues submitted through the JIRA Service Desk from internal teams and researchers. For more complex items, you will coordinate with the DevOps team while still maintaining ownership and ensuring resolution Improve processes and operational efficiency by analyzing recurring issues, proposing better workflows, and contributing to smooth communication between teams Develop automation tools to optimize and speed up operations, including exploring bots for repetitive tasks and writing scripts to auto-categorize or handle tickets in JIRA Collaborate cross-functionally with DevOps, Engineering, and Research teams to ensure stable and scalable system operations

Negotiation

View details

Senior Deep Learning Engineer - AI for Wireless Systems

Ho Chi Minh, Ha Noi - Viet Nam


Product

  • Machine Learning

Develop and test deep learning models for various wireless signal processing tasks including channel estimation, beam alignment, link adaptation, and scheduling. Utilize simulation tools and real-world datasets to create models that can be applied across different wireless scenarios. Build, train, and assess neural networks (e.g., CNNs, Transformers, GNNs) with PyTorch or TensorFlow. Engage in teamwork with researchers and system engineers to incorporate models into complete RAN systems. Enhance model efficiency for real-time processing and hardware acceleration. Participate in model assessment, performance comparison, and deployment preparation on GPU platforms.

Negotiation

View details