Senior System Software Engineer - NCCL

Yesterday

Apply Now
Logo of NVIDIA

NVIDIA

GPU-accelerated computing β€’ artificial intelligence β€’ deep learning β€’ virtual reality β€’ gaming

10,000+

Description

β€’ Engage with our partners and customers to root cause functional and performance issues reported with NCCL β€’ Conduct performance characterization and analysis of NCCL and DL applications on groundbreaking GPU clusters β€’ Develop tools and automation to isolate issues on new systems and platforms, including cloud platforms (Azure, AWS, GCP, etc.) β€’ Guide our customers and support teams on HPC knowledge and standard methodologies for running applications on multi-node clusters β€’ Document and conduct trainings/webinars for NCCL β€’ Engage with internal teams in different time zones on networking, GPUs, storage, infrastructure and support.

Requirements

β€’ B.S./M.S. degree in CS/CE or equivalent experience β€’ 5+ years of relevant experience β€’ Experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM) β€’ Excellent C/C++ programming skills, including debugging, profiling, code optimization, performance analysis, and test design β€’ Experience working with engineering or academic research community supporting HPC or AI β€’ Practical experience with high performance networking: Infiniband/RoCE/Ethernet networks, RDMA, topologies, congestion control β€’ Expert in Linux fundamentals and a scripting language, preferably Python β€’ Familiar with containers, cloud provisioning and scheduling tools (Docker, Docker Swarm, Kubernetes, SLURM, Ansible) β€’ Adaptability and passion to learn new areas and tools β€’ Flexibility to work and communicate effectively across different teams and timezones

Benefits

β€’ equity β€’ benefits

Apply Now

Similar Jobs

Built byΒ Lior Neu-ner. I'd love to hear your feedback β€” Get in touch via DM or lior@remoterocketship.com