Senior HPC DevOps Engineer

6 days ago

🇩🇪 Germany – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Apply Now
Logo of NVIDIA

NVIDIA

GPU-accelerated computing • artificial intelligence • deep learning • virtual reality • gaming

10,000+

Description

• Design, implement and maintain large scale HPC/AI clusters with monitoring, logging and alerting • Manage Linux job/workload schedules and orchestration tools • Develop and maintain continuous integration and delivery pipelines • Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources • Deploy monitoring solutions for the servers, network and storage • Perform troubleshooting bottom up from bare metal, operating system, software stack and application level • Being a technical resource, develop, re-define and document standard methodologies to share with internal teams • Support Research & Development activities and engage in POCs/POVs for future improvements

Requirements

• A degree in Computer Science, Engineering, or a related field with 5+ years of experience • Knowledge of HPC and AI solution technologies from CPU’s and GPU’s to high speed interconnects and supporting software • Experience with job scheduling workloads and orchestration tools such as Slurm, K8s • Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking (sockets, firewalld, iptables, wireshark, etc.) and internals, ACLs and OS level security protection and common protocols e.g. TCP, DHCP, DNS, etc. • Experience with multiple storage solutions such as Lustre, GPFS, zfs and xfs. • Familiarity with newer and emerging storage technologies. • Python programming and bash scripting experience. • Comfortable with automation and configuration management tools such as Jenkins, Ansible, Puppet/chef • Deep knowledge of Networking Protocols like InfiniBand, Ethernet • Deep understanding and experience with virtual systems (for example VMware, Hyper-V, KVM, or Citrix) • Familiarity with cloud computing platforms (e.g. AWS, Azure, Google Cloud)

Apply Now

Similar Jobs

October 11

Dev.Pro

501 - 1000

Build innovative POS applications at a leading digital payment consulting company.

🇩🇪 Germany – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

October 2

Lead a DevOps team to enhance Parity’s blockchain infrastructure.

🇩🇪 Germany – Remote

⏰ Full Time

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

September 26

Exasol

201 - 500

DevOps Engineer for Linux cluster at Exasol, enhancing development infrastructure.

🇩🇪 Germany – Remote

💰 Series A on 2013-09

⏰ Full Time

🟡 Mid-level

🟠 Senior

⛑ DevOps & Site Reliability Engineer (SRE)

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or lior@remoterocketship.com