Skip to main content
To KTH's start page To KTH's start page

Bandit Methods for Network Optimization

Safety, Exploration, and Coordination

Time: Mon 2023-10-30 09.00

Location: Kollegiesalen, Brinellvägen 8, Stockholm

Language: English

Subject area: Electrical Engineering

Doctoral student: Filippo Vannella , Reglerteknik

Opponent: Associate Professor Vincent Tan, Department of Mechanical Engineering, National University of Singapore, Singapore

Supervisor: Alexandre Proutiere, Reglerteknik

Export to calendar

QC 20231009


The increasing complexity of modern mobile networks poses unprecedented challenges to their optimization. Mobile Network Operators (MNOs) need to control a large number of network parameters to satisfy the users’ demands. The antenna tilt is an important example of controllable network parameter that has significant effects on the network coverage and capacity. Recently, sequential learning algorithms have emerged as promising tools for controlling network parameters efficiently and reducing operational expenses. Bandits are a class of sequential learning methods in which an agent interacts with an environment to learn an optimal policy by selecting actions and receiving rewards from an environment. However, these methods come with their own challenges with respect to safety, exploration, and coordination. In this thesis, we investigate the antenna tilt optimization problem using bandit methods with a focus on these challenges.

We investigate the safety aspect using offline learning methods in Contextual Bandits (CBs), where the goal is to learn an improved target policy, solely from offline data collected by a logging policy. Based on this data, a target policy is derived by minimizing its counterfactual estimated risk. We learn and evaluate several antenna tilt policies using real-world network data, showing that offline learning approaches can safely learn improved tilt control policies.

We then focus on the exploration aspect by investigating the Best Policy Identification (BPI) problem in Contextual Linear Bandit (CLB), in which the reward has a convenient linear structure. The goal is to identify an optimal policy using the least possible amount of data. We derive lower bounds on the number of samples required by any algorithm returning an approximately optimal policy. We devise algorithms learning optimal tilt control policies from existing data (passive learning) or from data actively generated by the algorithm (active learning). We demonstrate experimentally that our algorithms produce optimal tilt control policies using fewer samples than rule-based algorithms.

We explore the coordination aspect in a multi-agent bandit setting where the reward of each agent depends on the actions of other agents. We investigate both Best Arm Identification (BAI) and Regret Minimization (RM): in BAI, the objective is to find an optimal global action with minimal sample complexity, while in RM the objective is to maximize the cumulative reward over rounds. We derive lower bounds on the sample complexity and the regret, which characterize the corresponding optimal sampling strategy. Unfortunately, these bounds are obtained by solving combinatorial optimization problems that are hard to solve and cannot be leveraged in the design of efficient algorithms. Inspired by Mean Field (MF) methods, we devise a family of approximations to these problems. By leveraging these approximations, we devise algorithms for BAI and RM trading off the achievable statistical and computational complexity. We apply these algorithms to solve the antenna tilt optimization problem showcasing the scalability and statistical efficiency of the proposed methods.