Team Lead Site Reliability Engineer (SRE)

Description

We are working with a leading technology-driven trading firm to hire a Team Lead SRE to drive reliability, scalability, and performance across mission-critical systems. This role is ideal for engineers coming from Big Tech environments who are passionate about building resilient infrastructure and leading high-performing teams in a low-latency, high-availability setting.

Key Responsibilities

  • Lead and mentor a team of Site Reliability Engineers responsible for the uptime, performance, and scalability of production systems.

  • Define and implement SRE best practices, including SLIs, SLOs, error budgets, and incident management frameworks.

  • Own production reliability across trading and research platforms, ensuring systems operate with minimal latency and maximum availability.

  • Partner with software engineering, infrastructure, and trading teams to improve system design, observability, and operational excellence.

  • Drive automation initiatives to reduce toil, improve deployment pipelines, and enhance system self-healing capabilities.

  • Lead incident response, postmortems, and continuous improvement efforts across the platform.

Required Skills

  • Proven experience in a Site Reliability Engineering or Production Engineering role, ideally within a large-scale or Big Tech environment.

  • Strong programming skills in languages such as Python, Go, or Java.

  • Deep understanding of distributed systems, networking, and systems architecture.

  • Experience with observability tooling (e.g., Prometheus, Grafana, OpenTelemetry) and incident management practices.

  • Hands-on experience with cloud platforms (AWS, GCP, or Azure) and container orchestration (Kubernetes).

  • Track record of leading teams or mentoring engineers in high-performance environments.

  • Strong troubleshooting skills and the ability to operate effectively under pressure.

Preferred Qualifications

  • Background in low-latency systems and high-performance environments.

  • Experience with CI/CD systems, infrastructure as code, and large-scale production environments.

  • Familiarity with Linux internals, networking protocols, and performance tuning.

  • Prior experience managing on-call rotations and improving operational maturity.

Apply Today

Thank you for your interest in this opportunity. Please complete the form below and upload any relevant documents. A member of our team will review your application and be in touch soon.

Application Form
  • About
  • Key Markets
  • Hiring
  • Candidates
  • Insights