Senior Site Reliability Engineer - Observability and Telemetry

6 days ago

🇺🇸 United States – Remote

💵 $148k - $419.8k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Apply Now
Logo of NVIDIA

NVIDIA

GPU-accelerated computing • artificial intelligence • deep learning • virtual reality • gaming

10,000+

Description

• Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform with a focus on performance at scale, real time monitoring, logging and alerting • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement • Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews • Maintain services once they are live by measuring and monitoring availability, latency and overall system health • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity • Practice sustainable incident response and blameless postmortems • Be part of an on call rotation to support production systems.

Requirements

• BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience • 5+ years of experience with Infrastructure automation, distributed systems design, experience with design, develop tools for running large scale private or public cloud system in Production • 5+ years experience delivering foundational infrastructure and observability platforms. • Experience in one or more of the following: Python, Go, Perl or Ruby • In depth knowledge on Linux, Networking and Containers • Interest in crafting, analyzing and fixing large-scale distributed systems • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive. • Ability to debug and optimize code and automate routine tasks • Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack and Docker. • Experience running Grafana, OpenTelemetry, Prometheus, and similar observability focused tools.

Benefits

• Eligibility for equity and benefits.

Apply Now

Similar Jobs

6 days ago

Protegrity

201 - 500

DevOps Engineer at Protegrity, focusing on automation and deployment experiences.

🇺🇸 United States – Remote

💵 $100k - $114.6k / year

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

October 15

Leidos

10,000+

Leidos seeks an Azure DevOps Administrator to enhance CI/CD processes.

🇺🇸 United States – Remote

💵 $87.1k - $157.4k / year

⏰ Full Time

🔴 Lead

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

October 15

Lucidworks

201 - 500

DevOps Engineer for Lucidworks’ cloud platform, ensuring customer success through automation.

🇺🇸 United States – Remote

💵 $140k - $155k / year

💰 $100M Series F on 2019-08

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🗽 H1B Visa Sponsor

October 15

EvolutionIQ

51 - 200

Senior DevOps Engineer at EvolutionIQ optimizing cloud infrastructure for proprietary data.

🇺🇸 United States – Remote

💵 $180k - $200k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

🗽 H1B Visa Sponsor

October 15

Pluralsight

1001 - 5000

Senior Salesforce DevOps Engineer optimizing Salesforce DevOps at Pluralsight.

🇺🇸 United States – Remote

💵 $142.2k - $175.6k / year

💰 $31G Series C on 2016-12

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or lior@remoterocketship.com