Till innehåll på sidan

Statistical Learning in Linearly Structured Systems: Identification, Control, and Reinforcement Learning

Tid: On 2023-06-14 kl 16.00

Plats: D2, Lindstedtsvägen 5, Stockholm

Språk: Engelska

Ämnesområde: Elektro- och systemteknik

Respondent: Yassir Jedra , Reglerteknik

Opponent: Professor Alexander Rakhlin, Massashusetts Institute of Technology

Handledare: Professor Alexandre Proutiere, Reglerteknik

Exportera till kalender

QC 20230525

Abstract

In this thesis, we investigate the design and statistical efficiency of learning algorithms in systems with a linear structure. This study is carried along three main domains, namely identification, control, and reinforcement learning, and is presented as a collection of five papers. 

In the first paper, we consider the problem of best-arm identification in linear bandits.  We devise a simple learning algorithm based on the track-and-stop design whose sample complexity matches known instance-optimal lower bounds asymptotically. Actually, these instance-optimal lower bounds are obtained by solving an optimization problem, which in turn inspires the design of our algorithm. We also show that our algorithm is computationally competitive and performs remarkably well experimentally.  

In the second paper, we investigate the linear system identification problem at a finite sample size level. More precisely, we study the problem in the so-called Probably Approximately Correct (PAC) framework. We establish tight instance-specific sample complexity lower bounds via change-of-measure arguments for generic systems. In the case of stable linear dynamical systems, we derive the first non-asymptotic instance-dependent lower bounds and prove that the Least-Squares Estimator (LSE) achieves these limits in the high accuracy regime. Our analysis of the LSE is sharper, simpler, and easier to interpret than existing analyses, and relies on novel concentration results for covariates matrices with dependent data.

In the third paper, we consider the problem of learning linear quadratic regulators. We devise online learning algorithms and provide guarantees on their expected regret. We consider two distinct setups, namely when the dynamics of the system are partially known or not known at all. We achieve minimal regret scalings in both setups. The algorithm we propose is a simple variant of certainty equivalence controllers, where the estimates of the dynamics are continuously updated, and rely on a clever hysteresis switching mechanism. Empirical results suggest that such switching mechanism leads to fast burning time and does not harm the theoretical guarantees.

In the fourth paper, we consider the problem of best policy identification in discounted linear Markov Decision Processes (MDPs) under both the generative and the forward models. We derive instance-specific sample complexity lower bounds to identify a near-optimal policy, and devise algorithms matching these limits. In the generative model, our algorithm exhibits a sample complexity guarantee matching existing minimax and gap-dependent lower bounds. In the forward model, we identify sufficient conditions, weaker than ergodicity or communication, under which learnability is achievable. When these conditions are met, our algorithm is asymptotically matching an instance-specific complexity measure, corresponding to the value of an optimal experiment-design problem.    

In the fifth paper, we study model estimation in episodic block MDPs. In these MDPs, the decision maker observes rich contexts generated from a small number of latent states. We are interested in recovering the latent state decoding function (the mapping from the observations to latent states) from the data generated under a fixed behavior policy. We derive an information-theoretical lower bound on the error rate for estimating this function and propose an algorithm approaching this limit. We apply our results to the problem of learning near-optimal policies in the reward-free setting. Based on our efficient model estimation algorithm, we show that we can learn a policy converging (as the number of collected samples grows large) to the optimal policy at the best possible asymptotic minimax rate. Our analysis provides necessary and sufficient conditions under which exploiting the block structure results in improved sample complexity guarantees for identifying near-optimal policies.

urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-327357