Fundamental Limits in Stochastic Bandits
Time: Fri 2024-10-11 10.00
Location: D37, Lindstedtsvägen 5, Stockholm
Video link: https://kth-se.zoom.us/j/61569027581
Language: English
Doctoral student: Po-An Wang , Reglerteknik
Opponent: Alexandra Carpentier, Institut für Mathematik Universität Potsdam
Supervisor: Alexandre Proutiere, Reglerteknik
QC 20240918
Abstract
This thesis contributes to the field of stochastic bandits by exploring the fundamental limits (information-theoretic lower bounds) of three prevalent objects in various reinforcement learning applications, through a collection of five distinct papers. Each paper presents a novel perspective under a specific structure and scenario. The first paper investigates regret minimization in decentralized multi-agent multi-armed bandits. The second and third papers delve into pure exploration with fixed confidence in a broad spectrum of structured bandits. The last two papers focus on offering new insights into the best arm identification with a fixed budget.
In the first paper, two popular scenarios in a decentralized multi-agent setting are addressed, one involving collision and the other not. In each of them, we propose an instance-specific optimal algorithm. Interestingly, our results show that the fundamental limits match the ones in the centralized analogue.The second paper introduces a simple but versatile algorithm, Frank-Wolfe Sampling, which achieves instance-specific optimality across a wide collection of pure explorations in structured bandits. Meanwhile, the numerical results and current studies demonstrate the strong numerical performance of our algorithm in various pure exploration problems. However, Frank-Wolfe Sampling is not computationally efficient when the number of arms is extremely large. To address this issue, the third paper introduces Perturbed Frank-Wolfe Sampling, which can be implemented in polynomial time while maintaining the instance-specific minimal sample complexity in combinatorial semi-bandits.
Unlike the sample complexity or regret minimization discussed above, characterizing the fundamental limit of the error probability in best arm identification with a fixed budget remains a challenge. The fourth paper addresses this challenge in two-armed bandits, introducing a new class of algorithms, stable algorithms, which encompass a broad range of reasonable algorithms. We demonstrate that no consistent and stable algorithm surpasses the algorithm that samples each arm evenly, answering the open problems formulated in prior work. In general multi-armed bandits, the final paper in this thesis presents, to our knowledge, the first large deviation theorem for the generic adaptive algorithm. Based on this, we provide the exact analysis of the celebrated algorithm, Successive Rejects. Furthermore, this new large deviation technique allows us to devise and analyze a new adaptive algorithm, which is the current state-of-the-art to the best of our knowledge. This thesis provides new insight for deriving fundamental limits in various online stochastic learning problems. This understanding guides us to develop more efficient algorithms and systems.