Senior System Software Engineer - NCCL

October 22

Apply Now
Logo of NVIDIA

NVIDIA

GPU-accelerated computing • artificial intelligence • deep learning • virtual reality • gaming

10,000+ employees

Founded 1993

🤖 Artificial Intelligence

🎮 Gaming

Description

• Engage with our partners and customers to root cause functional and performance issues reported with NCCL • Conduct performance characterization and analysis of NCCL and DL applications on groundbreaking GPU clusters • Develop tools and automation to isolate issues on new systems and platforms, including cloud platforms (Azure, AWS, GCP, etc.) • Guide our customers and support teams on HPC knowledge and standard methodologies for running applications on multi-node clusters • Document and conduct trainings/webinars for NCCL • Engage with internal teams in different time zones on networking, GPUs, storage, infrastructure and support.

Requirements

• B.S./M.S. degree in CS/CE or equivalent experience • 5+ years of relevant experience • Experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM) • Excellent C/C++ programming skills, including debugging, profiling, code optimization, performance analysis, and test design • Experience working with engineering or academic research community supporting HPC or AI • Practical experience with high performance networking: Infiniband/RoCE/Ethernet networks, RDMA, topologies, congestion control • Expert in Linux fundamentals and a scripting language, preferably Python • Familiar with containers, cloud provisioning and scheduling tools (Docker, Docker Swarm, Kubernetes, SLURM, Ansible) • Adaptability and passion to learn new areas and tools • Flexibility to work and communicate effectively across different teams and timezones

Benefits

• equity • benefits

Apply Now

Similar Jobs

Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or lior@remoterocketship.com