- Salary: $300k - $600k
- Locations: New York
- Job Type: Full Time
- Job Category: Infrastructure
Description
Join a core engineering group as Lead Site Reliability Engineer, designing and scaling Linux platforms that underpin ML/AI-driven trading. You will architect and own reliability for massive simulation, HPC, and production workloads—ensuring ultra-reliable, ultra-fast trading systems. This is a hands-on, leadership role focused equally on technical depth, strategic decision-making, and driving platform SRE excellence.
Key Responsibilities
- Lead SRE practices for Linux platforms powering low-latency, high-throughput trading workloads.
- Architect, optimize, and tune Linux for performance, resilience, and minimal latency.
- Drive incident response, root cause analysis, and continuous reliability improvement across production systems.
- Oversee system automation and reproducibility—build, deploy, and fleet-manage bare-metal Linux and containerized stacks.
- Manage and enhance Kubernetes clusters, network configuration, and large-scale orchestration.
- Set observability standards; expand monitoring, alerting, and performance metrics across platforms.
- Analyze networking, kernel-level performance, and distributed systems—solving core challenges in a multi-petabyte, multi-cluster environment.
- Build Python tools for automation, reliability engineering, and performance analysis.
- Design highly distributed systems
Required Skills
- Lead SRE practices for Linux platforms powering low-latency, high-throughput trading workloads.
- Architect, optimize, and tune Linux for performance, resilience, and minimal latency.
- Drive incident response, root cause analysis, and continuous reliability improvement across production systems.
- Oversee system automation and reproducibility—build, deploy, and fleet-manage bare-metal Linux and containerized stacks.
- Manage and enhance Kubernetes clusters, network configuration, and large-scale orchestration.
- Set observability standards; expand monitoring, alerting, and performance metrics across platforms.
- Analyze networking, kernel-level performance, and distributed systems—solving core challenges in a multi-petabyte, multi-cluster environment.
- Build Python tools for automation, reliability engineering, and performance analysis.
- Design highly distributed systems
Preferred Qualifications
The ideal candidate comes from a top-tier tech environment (FAANG, elite trading, hyperscale infra). They have experience building technology 0→1, owning systems end-to-end, and working close to the metal. They will operate across everything from bare-metal Linux to modern build and observability stacks.
- Deep Linux, Scripting – Python, DevOps, Kubernetes
Apply Today
Thank you for your interest in this opportunity. Please complete the form below and upload any relevant documents. A member of our team will review your application and be in touch soon.