GPU Systems Engineer
Description
GPU Systems Engineer focused on scaling and optimizing large-scale HPC and AI research infrastructure within a highly performance-sensitive environment. This role sits at the intersection of compute, storage, networking, and operating systems, owning GPU cluster design, performance engineering, and automation across globally distributed data centers.
Key Responsibilities
Design, build, and optimize large-scale distributed GPU compute clusters
Diagnose and resolve performance bottlenecks across compute, storage, and networking layers
Profile, benchmark, and fine-tune GPU workloads alongside research teams
Automate deployment, monitoring, and troubleshooting across thousands of Linux nodes
Own infrastructure projects end-to-end, from design through production support
Test and deploy new hardware and software platforms
Partner with vendors to resolve complex system and hardware issues
Required Skills
Design, build, and optimize large-scale distributed GPU compute clusters
Diagnose and resolve performance bottlenecks across compute, storage, and networking layers
Profile, benchmark, and fine-tune GPU workloads alongside research teams
Automate deployment, monitoring, and troubleshooting across thousands of Linux nodes
Own infrastructure projects end-to-end, from design through production support
Test and deploy new hardware and software platforms
Partner with vendors to resolve complex system and hardware issues
Preferred Qualifications
Familiarity with NVIDIA technologies such as NCCL, GPUDirect RDMA, NVLink
Experience with configuration management tools such as Salt, Ansible, Puppet, or Chef
Experience operating at scale in performance-critical environments
Comfortable working in fast-paced, high-impact technical environments
Apply Today
Thank you for your interest in this opportunity. Please complete the form below and upload any relevant documents. A member of our team will review your application and be in touch soon.