Title: Strategies for Training Massive AI Workloads

Abstract: The rapid advancement of deep learning for generative tasks has shown strong scaling laws where the model performance increases proportional to its size. This has led to the proliferation of machine learning models with billions and trillions of parameters. Training such large-scale models presents significant challenges in memory efficiency, compute utilization, and communication overhead. Solving these challenges requires non-trivial strategies for parallelizing and synchronizing models at scale. This talk explores the landscape of training performant models at large scales, and discusses various techniques such as 5D Parallelism, DeepSpeed and FSDP. We discuss the trade-offs between various methods in terms of memory efficiency, communication overhead, and compute intensity, offering insights into their optimizations. Finally, we delve into the best practices and practical implementation insights for training large models. 

Bio: Tanmaey Gupta is a first year Ph.D. student working with Prof. Chris De Sa and Prof. Udit Gupta at the intersection of Systems and Machine Learning. His interests lie in designing and implementing systems, software, and algorithms which enable efficient and scalable machine learning tasks in distributed and resource-constrained settings. Prior to joining Cornell, he was a Pre-doctoral Research Fellow at Microsoft Research India, working on projects in the AI Infrastructure team and the Center for Societal impact through Cloud and Artificial Intelligence.