Tools and Methods for Distributed and Large-Scale Training of Deep Neural Networks
Time: Thu 2025-03-27 09.00
Location: Sal-A, Electrum, Kistagången 16
Video link: https://kth-se.zoom.us/j/69403203069
Language: English
Subject area: Computer Science
Doctoral student: Sina Sheikholeslami , Programvaruteknik och datorsystem, SCS
Opponent: Associate Professor Salman Toor, Uppsala University
Supervisor: Professor Vladimir Vlassov, Programvaruteknik och datorsystem, SCS; Associate Professor Amir H. Payberah, Programvaruteknik och datorsystem, SCS; Dr. Jim Dowling, Hopsworks AB
QC 20250304
Abstract
Deep Neural Networks (DNNs) have been at the forefront of recent breakthroughs in Machine Learning (ML) and Deep Learning (DL). DNNs are increasingly used in various tasks, from Earth observation and analysis of satellite images to medical diagnosis and smart chatbots. A major contributor to these advances has been the abundance of training data, computation resources, and frameworks that enable efficient training of ever-larger and more complex DNNs, within a paradigm referred to as distributed DL, and in particular, distributed training, which is the focus of this doctoral dissertation. In distributed training, the data and computation are distributed across several workers as opposed to single-host training in which both the data and computation reside and happen on a single worker. In this setting, distributed training can help overcome the limitations of single-host training, such as memory constraints, computational bottlenecks, and data availability.
However, distributed training comes with a number of challenges that need to be carefully addressed in order to have a system that efficiently makes use of it. These challenges include, but are not limited to, efficient distribution of computation and data across the workers, the presence of straggler workers in a cluster (workers that get significantly behind in their computation step compared to the other workers), especially in synchronous execution settings, and communication and synchronization among the workers. This implies that the system should provide scalability in both the computation and the data dimensions.
On the other hand, from a programming and usability point of view, using the distributed training paradigm typically requires knowledge of distributed computing principles and experience with distributed and data-intensive computing frameworks as well as applying major changes to the code used for single-host training. Furthermore, as training a DNN involves several steps and stages (e.g., data preparation, hyperparameter tuning, model training, etc.), it would be desirable to possibly reuse the computational results of different steps in each other (e.g., reusing weights learned during hyperparameter tuning trials, for weight initialization of the model training step) in order to improve training time. Finally, when developing larger and more complex DNNs, we also need to know about each design choice's contributions.
The contributions of this doctoral dissertation address the aforementioned challenges, and collectively optimize large-scale DNN training, making it more accessible, efficient, and computationally sustainable while reducing the redundancy in ML/DL workflows, and providing usable tools for conducting ablation studies.