Senior Site Reliability Engineer

September 15

🇮🇳 India – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Apply Now
Logo of NVIDIA

NVIDIA

GPU-accelerated computing • artificial intelligence • deep learning • virtual reality • gaming

10,000+

Description

• Support and work on groundbreaking Generative AI inferencing workloads running in a globally-distributed heterogeneous environment spanning all major cloud service providers. • Ensure the best possible performance and availability on current and next-generation GPU architectures. • Collaborate closely with the service owner, architecture, research, and tools teams at NVIDIA to achieve ideal results for AI problems at hand. • Monitoring & supporting critical high-performance, large-scale services running multi-cloud. • Participate in the triage & resolution of sophisticated infra-related issues. • Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces. • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity. • Practice balanced incident response and blameless postmortems. • Be part of an on-call rotation to support production systems and lead significant production improvement around tooling, automation, and process. • Architect, design, and code using your expertise to optimize, deploy and productize services.

Requirements

• 8+ years of experience operating & owning end-to-end availability and performance of mission-critical services in a live-site production environment, either as an SRE or Service Owner. • 3+ years executing incident management and participating in an on call shift. • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience. • Solid understanding of containerization and microservices architecture, K8s. • Excellent understanding of the Kubernetes ecosystem and best practices with K8s. • Ability to dissect complex problems into simple sub-problems and use available solutions to resolve them. • Technical leadership beyond development that includes scoping, requirements capturing, leading and influencing multiple teams of engineers on broad development initiatives. • Lead significant production activities, including change management, post-mortem reviews, workflow processes, software design, and delivering software automation in various languages (Python, or Go ) and technologies (CI/CD auto-remediation, alert correlation). • Best in understanding SLO/SLIs, error budgeting, KPIs, and configuring for highly sophisticated services. • Experience with the ELK and Prometheus stacks as a power user and administrator. • Excellent understanding of cloud environments and technologies, especially AWS, Azure, GCP, or OCI. • Proven strengths in identifying, mitigating, and root-causing issues while continuously seeking ways to drive optimization, efficiency, and the bottom line.

Apply Now

Similar Jobs

September 15

Kyndryl

10,000+

Site Reliability Engineer ensuring reliability for Kyndryl's technology systems.

🇮🇳 India – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

August 27

Dynamo AI

11 - 50

Ensure smooth operation of production environments and optimize CI/CD pipelines.

🇮🇳 India – Remote

💰 $15.1M Series A on 2023-08

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

August 26

Captions

51 - 200

Manage, maintain, and troubleshoot cloud networking infrastructure for multi-cloud environments.

🇮🇳 India – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or lior@remoterocketship.com