NVIDIA

Website LinkedIn All Job Openings

GPU-accelerated computing • artificial intelligence • deep learning • virtual reality • gaming

10,000+

Senior Site Reliability Engineer

September 15

🇮🇳 India – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

AWS

Azure

Cloud

Google Cloud Platform

Kubernetes

Microservices

Prometheus

Python

PyTorch

Tensorflow

Apply Now

NVIDIA

Website LinkedIn All Job Openings

GPU-accelerated computing • artificial intelligence • deep learning • virtual reality • gaming

10,000+

Description

• Support and work on groundbreaking Generative AI inferencing workloads running in a globally-distributed heterogeneous environment spanning all major cloud service providers. • Ensure the best possible performance and availability on current and next-generation GPU architectures. • Collaborate closely with the service owner, architecture, research, and tools teams at NVIDIA to achieve ideal results for AI problems at hand. • Monitoring & supporting critical high-performance, large-scale services running multi-cloud. • Participate in the triage & resolution of sophisticated infra-related issues. • Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces. • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity. • Practice balanced incident response and blameless postmortems. • Be part of an on-call rotation to support production systems and lead significant production improvement around tooling, automation, and process. • Architect, design, and code using your expertise to optimize, deploy and productize services.

Requirements

• 8+ years of experience operating & owning end-to-end availability and performance of mission-critical services in a live-site production environment, either as an SRE or Service Owner. • 3+ years executing incident management and participating in an on call shift. • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience. • Solid understanding of containerization and microservices architecture, K8s. • Excellent understanding of the Kubernetes ecosystem and best practices with K8s. • Ability to dissect complex problems into simple sub-problems and use available solutions to resolve them. • Technical leadership beyond development that includes scoping, requirements capturing, leading and influencing multiple teams of engineers on broad development initiatives. • Lead significant production activities, including change management, post-mortem reviews, workflow processes, software design, and delivering software automation in various languages (Python, or Go ) and technologies (CI/CD auto-remediation, alert correlation). • Best in understanding SLO/SLIs, error budgeting, KPIs, and configuring for highly sophisticated services. • Experience with the ELK and Prometheus stacks as a power user and administrator. • Excellent understanding of cloud environments and technologies, especially AWS, Azure, GCP, or OCI. • Proven strengths in identifying, mitigating, and root-causing issues while continuously seeking ways to drive optimization, efficiency, and the bottom line.

Apply Now