Senior Production SRE Engineer - Storage

September 15

🇵🇱 Poland – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Apply Now
Logo of NVIDIA

NVIDIA

GPU-accelerated computing • artificial intelligence • deep learning • virtual reality • gaming

10,000+

Description

• Assist in the design, implementation, and support of large-scale storage clusters, including monitoring, logging, and alerting. • Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows, which are otherwise hard to understand. • Work closely with peers on the team to improve the lifecycle of services – from inception and design, through deployment, operation, and refinement. • Support services before they go live through activities such as system design consulting, developing software and frameworks, capacity management, and launch reviews. • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health, including leveraging machine learning models. • Scale systems sustainably through mechanisms like AI/ML and automation, and evolve systems by pushing for changes that improve reliability and velocity. • Practice sustainable incident response and blameless postmortems. • Be part of an on-call rotation to support production systems.

Requirements

• BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics) or equivalent experience. • At least 5+ years practical experience. • Background with algorithms, data structures, complexity analysis, software design, and maintaining large-scale Linux-based systems. • Experience in one or more of the following: C/C++, Java, Python, Go, Perl or Ruby, AI/ML frameworks and methodologies. • Good knowledge of infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform. • Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic stack. • Experience with Git, code review, pipelines, and CI/CD. • Strong debugging skills with a systematic problem-solving approach to identify complex problems.

Apply Now

Similar Jobs

September 3

Beekeeper

51 - 200

Build and maintain Beekeeper’s production infrastructure for seamless user experience.

🇵🇱 Poland – Remote

💰 $50M Series C on 2022-11

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

August 31

Nord Security

1001 - 5000

Infrastructure team solving complex system and network problems with automation.

🇵🇱 Poland – Remote

💵 PLN22.8k - PLN32.9k / year

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

August 28

Support management of private cloud environment and enhance GenAi applications for municipalities.

🇵🇱 Poland – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or lior@remoterocketship.com