Description
GPU Systems Engineer focused on scaling and optimizing large-scale HPC and AI research infrastructure within a highly performance-sensitive environment. This role sits at the intersection of compute, storage, networking, and operating systems, owning GPU cluster design, performance engineering, and automation across globally distributed data centers.
Key Responsibilities
-
Design, build, and optimize large-scale distributed GPU compute clusters
-
Diagnose and resolve performance bottlenecks across compute, storage, and networking layers
-
Profile, benchmark, and fine-tune GPU workloads alongside research teams
-
Automate deployment, monitoring, and troubleshooting across thousands of Linux nodes
-
Own infrastructure projects end-to-end, from design through production support
-
Test and deploy new hardware and software platforms
-
Partner with vendors to resolve complex system and hardware issues
Required Skills
-
Design, build, and optimize large-scale distributed GPU compute clusters
-
Diagnose and resolve performance bottlenecks across compute, storage, and networking layers
-
Profile, benchmark, and fine-tune GPU workloads alongside research teams
-
Automate deployment, monitoring, and troubleshooting across thousands of Linux nodes
-
Own infrastructure projects end-to-end, from design through production support
-
Test and deploy new hardware and software platforms
-
Partner with vendors to resolve complex system and hardware issues
Preferred Qualifications
-
Familiarity with NVIDIA technologies such as NCCL, GPUDirect RDMA, NVLink
-
Experience with configuration management tools such as Salt, Ansible, Puppet, or Chef
-
Experience operating at scale in performance-critical environments
-
Comfortable working in fast-paced, high-impact technical environments
Apply Today
Thank you for your interest in this opportunity. Please complete the form below and upload any relevant documents. A member of our team will review your application and be in touch soon.