GPU-accelerated computing • artificial intelligence • deep learning • virtual reality • gaming
October 22
GPU-accelerated computing • artificial intelligence • deep learning • virtual reality • gaming
• Primary responsibilities will include building robust AI/HPC infrastructure for new and existing customers. • Support operational and reliability aspects of large-scale AI clusters, focusing on performance at scale, training stability, real-time monitoring, logging, and alerting. • Engage in and improve the whole lifecycle of services from inception and design through deployment, operation, and refinement. • Your primary focus would be on understanding the AI workload and how it interacts with other parts of the system like networking, storage, deep learning frameworks, data cleaning tools, etc. • Help maintain services once they are live by measuring and monitoring progress of AI jobs and helping engineering design solutions for more robust training at scale. • Provide feedback to internal teams such as opening bugs, documenting workarounds, and suggesting improvements. • Regional travel is required for on-site visits with customers.
• BS/MS/PhD or equivalent experience in Computer Science, Data Science, Electrical/Computer Engineering, Physics, Mathematics, other Engineering fields • At least 8 years work or research experience with Python/ C++ / other software development. • Track record of medium to large scale AI training and understanding of key libraries used for NLP/LLM/VLA training (NeMo Framework, DeepSpeed etc.) • You are excited to work with multiple levels and teams across organizations (Engineering, Product, Sales and Marketing team) • Capable of working in a constantly evolving environment without losing focus. • Ability to multitask in a fast-paced environment. • Driven with strong analytical and problem-solving skills. • Strong time-management and organization skills for coordinating multiple initiatives, priorities and implementations of new technology and products into very sophisticated projects. • You are a self-starter with demeanor for growth, passion for continuous learning and sharing findings across the team. • Excellent verbal, written communication, and technical presentation skills in English.
Apply Now