NVIDIA

Website LinkedIn All Job Openings

GPU-accelerated computing • artificial intelligence • deep learning • virtual reality • gaming

10,000+

Senior Production SRE Engineer - Storage

September 15

🇵🇱 Poland – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Ansible

Chef

Cloud

Distributed Systems

Docker

Java

Kubernetes

OpenStack

Perl

Prometheus

Puppet

Python

Ruby

Terraform

Apply Now

NVIDIA

Website LinkedIn All Job Openings

GPU-accelerated computing • artificial intelligence • deep learning • virtual reality • gaming

10,000+

Description

• Assist in the design, implementation, and support of large-scale storage clusters, including monitoring, logging, and alerting. • Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows, which are otherwise hard to understand. • Work closely with peers on the team to improve the lifecycle of services – from inception and design, through deployment, operation, and refinement. • Support services before they go live through activities such as system design consulting, developing software and frameworks, capacity management, and launch reviews. • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health, including leveraging machine learning models. • Scale systems sustainably through mechanisms like AI/ML and automation, and evolve systems by pushing for changes that improve reliability and velocity. • Practice sustainable incident response and blameless postmortems. • Be part of an on-call rotation to support production systems.

Requirements

• BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics) or equivalent experience. • At least 5+ years practical experience. • Background with algorithms, data structures, complexity analysis, software design, and maintaining large-scale Linux-based systems. • Experience in one or more of the following: C/C++, Java, Python, Go, Perl or Ruby, AI/ML frameworks and methodologies. • Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform. • Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic stack. • Experience with Git, code review, pipelines, and CI/CD. • Strong debugging skills with a systematic problem-solving approach to identify complex problems.

Apply Now