October 31
• Performance-oriented programming in CUDA, C++, Cython, Triton. • Accelerate high-level primitives used to train Large Language Models (LLMs) and optimize distributed communication over AWS EFAv2. • Working on poolside's own implementation of distributed training for LLMs. • Ensure cutting-edge performance of LLM pre-training and fine tuning on huge state-of-the-art GPU clusters. • Profile CPU and CUDA code at several abstraction levels. • Debug and profile distributed applications. • Troubleshoot undocumented CUDA internals. • Hack the NCCL library used for GPU communication. • Tune vanilla CUDA, Triton, CUTLASS kernels for the latest NVIDIA GPUs. • Hack PyTorch internals.
• Engineering background • Expert understanding of GPU hardware/architecture • Strong C/C++ programming skills • Fine-grained knowledge of CUDA programming • Strong algorithmic skills • System programming on Linux experience • Plus: knowledge of CPython internals and experience of native extension development • Plus: knowledge of AWS EFA internals • Plus: compiler development background
• Fully remote work & flexible hours • 37 days/year of vacation & holidays • Health insurance allowance for you and dependents • Company-provided equipment • Wellbeing, always-be-learning and home office allowances • Frequent team get togethers • Great diverse & inclusive people-first culture
Apply NowAugust 23
Generate training data for enterprise LLMs using a hardware design platform.