Director, Site Reliability Engineering

Benchmark is a global product realization services company that specializes in providing comprehensive solutions in advanced computing, commercial aerospace, defense, medical technologies, and semiconductor capital equipment. The company offers a range of services from design engineering and precision machining to full-system electronic assembly and lifecycle management, ensuring reliable support for innovative products in demanding markets. With a collaborative approach that leverages cross-functional teams, Benchmark aims to be a trusted partner in delivering customized solutions tailored to complex challenges.

Advanced Technology • Design Engineering • Manufacturing • Order Fulfillment • Design

10,000+ employees

Founded 1979

🚀 Aerospace

⚕️ Healthcare Insurance

Director, Site Reliability Engineering

March 17

🇺🇸 United States – Remote

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

AWS

Cloud

Distributed Systems

Grafana

Prometheus

Python

Ray

Apply Now

Benchmark

Search More DevOps Engineer Jobs

Advanced Technology • Design Engineering • Manufacturing • Order Fulfillment • Design

10,000+ employees

Founded 1979

🚀 Aerospace

⚕️ Healthcare Insurance

📋 Description

• We are seeking a Director of Site Reliability Engineering (SRE) to lead our SRE team in ensuring the availability, performance, and scalability of our critical systems. • This role is responsible for defining and driving reliability strategies, operational excellence, and incident response processes at scale. • You will collaborate closely with engineering, DevOps, and product teams to establish best practices and implement processes that enhance system resilience and service performance. • Define and execute the vision for site reliability, balancing innovation with operational stability. • Lead, mentor, and grow a high-performing SRE team, fostering a culture of ownership and continuous improvement. • Partner with Engineering, DevOps, and Product teams to embed reliability best practices into the development lifecycle. • Establish and refine SLIs, SLOs, and error budgets to measure and improve service reliability. • Develop and drive incident management processes, including real-time incident response, on-call coordination, and postmortem analysis to prevent recurring issues. • Implement and standardize operational readiness reviews and escalation procedures to ensure teams are equipped to handle incidents effectively. • Drive initiatives to reduce operational toil, leveraging automation where applicable to enhance team efficiency. • Collaborate with engineering teams to define performance testing and capacity planning strategies to proactively mitigate reliability risks. • Champion the adoption of observability, logging, and monitoring best practices, ensuring visibility into system health and performance.

🎯 Requirements

• 8+ years of experience in Site Reliability Engineering, DevOps, or related fields, with at least 3+ years in a leadership role. • Proven track record of driving operational excellence in large-scale, distributed systems. • Expertise in defining and implementing SLIs, SLOs, error budgets, and incident management processes. • Strong knowledge of observability tools such as Prometheus, Grafana, Datadog, New Relic, or similar. • Experience leading on-call rotations, postmortems, and operational readiness programs. • Excellent leadership, communication, and stakeholder management skills. • Deep experience with AWS cloud environments, including operational best practices for high availability and reliability. • AWS certifications such as AWS Certified DevOps Engineer – Professional, AWS Certified Solutions Architect – Professional, or AWS Certified Advanced Networking – Specialty. • Experience with AWS monitoring and logging tools (CloudWatch, X-Ray, AWS Config, GuardDuty). • Experience scaling SRE practices in high-growth or regulated environments. • Hands-on background in software engineering with Python, Bash, or similar languages.

Apply Now

Similar Jobs

Principal Solutions Advisor - DevOps

March 13

CDW

10,000+ employees

🏢 Enterprise

☁️ SaaS

🔒 Cybersecurity

Drive pre-sales solution design for CDW's technologies, focusing on customer engagement and sales strategies.

🇺🇸 United States – Remote

💵 $140k - $172k / year

💰 Post-IPO Equity on 2015-07

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Ansible

Docker

OpenShift

Puppet

Head of SRE

March 12

Strike

51 - 200

₿ Crypto

💳 Fintech

🛍️ eCommerce

Head of SRE responsible for leading the team and driving operational excellence at Strike.

🇺🇸 United States – Remote

💵 $120k - $202k / year

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🦅 H1B Visa Sponsor

Cloud

Google Cloud Platform

Grafana

Kubernetes

Prometheus

Python

Terraform

Staff Site Reliability Engineer - Telecom & SMS

March 12

EZ Texting

51 - 200

🤝 B2B

☁️ SaaS

As Staff SRE for Telecom & SMS at EZ Texting, lead reliability strategies and system operations.

🇺🇸 United States – Remote

💵 $145k - $195k / year

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

AWS

Azure

Cloud

Docker

ElasticSearch

Google Cloud Platform

HAProxy

Java

Jenkins

Kubernetes

Linux

MySQL

NGINX

Spring

Spring Boot

SpringBoot

Terraform

VoIP

Principal Site Reliability Engineer

March 11

Global InfoTek, Inc.

51 - 200

🔒 Cybersecurity

🤖 Artificial Intelligence

🏛️ Government

Global InfoTek is seeking a Principal Site Reliability Engineer to build and maintain complex infrastructures, enabling continuous delivery and monitoring.

🇺🇸 United States – Remote

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

Azure

Cloud

ElasticSearch

Grafana

Java

Kubernetes

Linux

MongoDB

MySQL

Node.js

Postgres

Prometheus

Python

Subversion

Terraform

Director of DevOps

March 8

PortPro

51 - 200

🚗 Transport

☁️ SaaS

Seeking a skilled Director of DevOps to manage cloud infrastructure and CI/CD pipelines in a remote role.

🇺🇸 United States – Remote

💰 $12M Series A on 2023-01

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Azure

Cloud

Distributed Systems

Docker

Google Cloud Platform

Grafana

Jenkins

Kubernetes

Microservices

Prometheus

Terraform

Vault