Reliable and Efficient Distributed Machine Learning
Time: Thu 2022-04-28 13.30
Location: F3, Lindstedtsvägen 26 & 28, Stockholm
Language: English
Subject area: Electrical Engineering
Doctoral student: Hao Chen , Optical Network Laboratory (ON Lab)
Opponent: Professor Yan Zhang,
Supervisor: Ming Xiao, Skolan för elektroteknik och datavetenskap (EECS); Mikael Skoglund, Skolan för elektroteknik och datavetenskap (EECS)
QC 20220404
Abstract
With the ever-increasing penetration and proliferation of various smart Internet of Things (IoT) applications, machine learning (ML) is envisioned to be a key technique for big-data-driven modelling and analysis. Since massive data generated from these IoT devices are commonly collected and stored in a distributed manner, ML at the networks, e.g., distributed machine learning (DML), has been a promising emerging paradigm, especially for large-scale model training. In this thesis, we explore the optimization and design of DML algorithms under different network conditions. Our main research with regards to DML can be sorted into the following four aspects/papers as detailed below.
In the first part of the thesis, we explore fully-decentralized ML by utilizing alternating direction method of multipliers (ADMM). Specifically, to address the two main critical challenges in DML systems, i.e., communication bottleneck and stragglers (nodes/devices with slow responses), an error-control-coding-based stochastic incremental ADMM (csI-ADMM) is proposed. Given an appropriate mini-batch size, it is proved that the proposed csI-ADMM method has a $O( 1/\sqrt{k})$) convergence rate and $O(1/{\mu ^2})$ communication cost, where $k$ denotes the number of iterations and $\mu$ is the target accuracy. In addition, tradeoff between the convergence rate and the number of stragglers, as well as the relationship between mini-batch size and number of stragglers, are both theoretically and experimentally analyzed.
In the second part of the thesis, we investigate the asynchronous approach for fully-decentralized federated learning (FL). Specifically, an asynchronous parallel incremental block-coordinate descent (API-BCD) algorithm is proposed, where multiple nodes/devices are active in an asynchronous fashion to accelerate the convergence speed. The solution convergence of API-BCD is theoretically proved and simulation results demonstrate its superior performance in terms of both running speed and communication costs compared with state-of-the-art algorithms.
The third part of the thesis is devoted to the study of jointly optimizing communication efficiency and wireless resources for FL over wireless networks. Accordingly, an overall optimization problem is formulated, which is divided into two sub-problems, i.e., the client scheduling problem and the resource allocation problem for tractability. More specifically, to reduce the communication costs, a communication-efficient client scheduling policy is proposed by limiting communication exchanges and reusing stale local models. To optimize resource allocation at each communication round of FL training, an optimal solution based on linear search method is derived. The proposed communication-efficient FL (CEFL) algorithm is evaluated both analytically and by simulation. The final part of the thesis is a case study of implementing FL in low Earth orbit (LEO) based satellite communication networks. We investigate four possible architectures of combining ML in LEO-based computing networks. The learning performance of the proposed strategies is evaluated by simulation and results validate that FL-based computing networks can significantly reduce communication overheads and latency.