GPU-accelerated computing • artificial intelligence • deep learning • virtual reality • gaming
10,000+
September 15
GPU-accelerated computing • artificial intelligence • deep learning • virtual reality • gaming
10,000+
• Support and work on groundbreaking Generative AI inferencing workloads running in a globally-distributed heterogeneous environment spanning all major cloud service providers. • Ensure the best possible performance and availability on current and next-generation GPU architectures. • Collaborate closely with the service owner, architecture, research, and tools teams at NVIDIA to achieve ideal results for AI problems at hand. • Monitoring & supporting critical high-performance, large-scale services running multi-cloud. • Participate in the triage & resolution of sophisticated infra-related issues. • Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces. • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity. • Practice balanced incident response and blameless postmortems. • Be part of an on-call rotation to support production systems and lead significant production improvement around tooling, automation, and process. • Architect, design, and code using your expertise to optimize, deploy and productize services.
• 8+ years of experience operating & owning end-to-end availability and performance of mission-critical services in a live-site production environment, either as an SRE or Service Owner. • 3+ years executing incident management and participating in an on call shift. • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience. • Solid understanding of containerization and microservices architecture, K8s. • Excellent understanding of the Kubernetes ecosystem and best practices with K8s. • Ability to dissect complex problems into simple sub-problems and use available solutions to resolve them. • Technical leadership beyond development that includes scoping, requirements capturing, leading and influencing multiple teams of engineers on broad development initiatives. • Lead significant production activities, including change management, post-mortem reviews, workflow processes, software design, and delivering software automation in various languages (Python, or Go ) and technologies (CI/CD auto-remediation, alert correlation). • Best in understanding SLO/SLIs, error budgeting, KPIs, and configuring for highly sophisticated services. • Experience with the ELK and Prometheus stacks as a power user and administrator. • Excellent understanding of cloud environments and technologies, especially AWS, Azure, GCP, or OCI. • Proven strengths in identifying, mitigating, and root-causing issues while continuously seeking ways to drive optimization, efficiency, and the bottom line.
Apply NowSeptember 15
10,000+
Site Reliability Engineer ensuring reliability for Kyndryl's technology systems.
September 13
51 - 200
DevOps Engineer needed for True Fit's high reliability consumer experience platform.
August 26
51 - 200
Manage, maintain, and troubleshoot cloud networking infrastructure for multi-cloud environments.
July 26
201 - 500
Innovate financial services and payment solutions for Australian businesses.
🇮🇳 India – Remote
💰 Series B on 2022-03
⏰ Full Time
🟡 Mid-level
🟠 Senior
⛑ DevOps & Site Reliability Engineer (SRE)
July 13
201 - 500
🇮🇳 India – Remote
💰 Venture Round on 2007-12
⏰ Full Time
🟠 Senior
⛑ DevOps & Site Reliability Engineer (SRE)