NVIDIA

Website LinkedIn All Job Openings

GPU-accelerated computing • artificial intelligence • deep learning • virtual reality • gaming

10,000+

Senior Site Reliability Engineer - Observability and Telemetry

6 days ago

🇺🇸 United States – Remote

💵 $148k - $419.8k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Cloud

Distributed Systems

Docker

Grafana

Kubernetes

Open Source

OpenStack

Perl

Prometheus

Python

Ruby

Apply Now

NVIDIA

Website LinkedIn All Job Openings

GPU-accelerated computing • artificial intelligence • deep learning • virtual reality • gaming

10,000+

Description

• Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform with a focus on performance at scale, real time monitoring, logging and alerting • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement • Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews • Maintain services once they are live by measuring and monitoring availability, latency and overall system health • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity • Practice sustainable incident response and blameless postmortems • Be part of an on call rotation to support production systems.

Requirements

• BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience • 5+ years of experience with Infrastructure automation, distributed systems design, experience with design, develop tools for running large scale private or public cloud system in Production • 5+ years experience delivering foundational infrastructure and observability platforms. • Experience in one or more of the following: Python, Go, Perl or Ruby • In depth knowledge on Linux, Networking and Containers • Interest in crafting, analyzing and fixing large-scale distributed systems • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive. • Ability to debug and optimize code and automate routine tasks • Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack and Docker. • Experience running Grafana, OpenTelemetry, Prometheus, and similar observability focused tools.