NVIDIA

Website LinkedIn All Job Openings

GPU-accelerated computing • artificial intelligence • deep learning • virtual reality • gaming

10,000+

Principal Infrastructure SRE - Compute

October 31

🇺🇸 United States – Remote

💵 $248k - $385.3k / year

⏰ Full Time

🔴 Lead

⛑ DevOps & Site Reliability Engineer (SRE)

🗽 H1B Visa Sponsor

AWS

Azure

Cloud

DNS

Google Cloud Platform

Kubernetes

Microservices

OpenShift

Python

Terraform

Apply Now

NVIDIA

Website LinkedIn All Job Openings

GPU-accelerated computing • artificial intelligence • deep learning • virtual reality • gaming

10,000+

Description

• Lead initiatives to transform IT Compute platform architecture to build new service offerings across On-Prem & Cloud • Define and implement metrics to measure the efficiency of compute platforms & services • Collect and review system data for capacity and planning purposes • Develop and maintain tools for collecting, analyzing, and visualizing data for reporting, alerting, monitoring • Collaborate with NVIDIA leadership, senior engineers, program managers, and product managers

Requirements

• Bachelor’s degree in Engineering, Computer Science, Mathematics, or related field, or equivalent experience • 12+ years of proven experience in compute platform engineering with a focus on automation • Proven experience in designing and deploying virtualization architectures including expertise with Kubernetes distributions • In-depth knowledge of hardware technologies, including SR-IOV, DPU, and GPU • Proven experience evaluating existing application architectures and identify opportunities for containerization • Strong analytical skills with the ability to define and track key performance metrics • Experience in developing tools for data analysis and performance profiling • Proficiency in programming languages such as Go and/or Python • Experience with running large environments consisting of BareMetal, large scale virtualized environment with a mix of tens of thousands of VM’s and cloud infrastructure