In decision making problems for continuous state and action spaces, linear dynamical models are widely employed. Specifically, policies for stochastic linear systems subject to quadratic cost functions capture a large number of applications in reinforcement learning. Selected randomized policies have been studied in the literature recently that address the trade-off between identification and control. However, little is known about policies based on bootstrapping observed states and actions. In this work, we show that bootstrap-based policies achieve a square root scaling of regret with respect to time. We also obtain results on the accuracy of learning the model's dynamics. Corroborative numerical analysis that illustrates the technical results is also provided.

We consider adaptive control of the Linear Quadratic Regulator (LQR), where
an unknown linear system is controlled subject to quadratic costs. Leveraging
recent developments in the estimation of linear systems and in robust controller
synthesis, we present the first provably polynomial time algorithm that provides
high probability guarantees of sub-linear regret on this problem. We further study
the interplay between regret minimization and parameter estimation by proving a
lower bound on the expected regret in terms of the exploration schedule used by any
algorithm. Finally, we conduct a numerical study comparing our robust adaptive
algorithm to other methods from the adaptive LQR literature, and demonstrate the
flexibility of our proposed method by extending it to a demand forecasting problem
subject to state constraints.

We consider adaptive control of the Linear Quadratic Regulator (LQR), where an unknown linear system is controlled subject to quadratic costs. Leveraging recent developments in the estimation of linear systems and in robust controller synthesis, we present the first provably polynomial time algorithm that provides high probability guarantees of sub-linear regret on this problem. We further study the interplay between regret minimization and parameter estimation by proving a lower bound on the expected regret in terms of the exploration schedule used by any algorithm. Finally, we conduct a numerical study comparing our robust adaptive algorithm to other methods from the adaptive LQR literature, and demonstrate the flexibility of our proposed method by extending it to a demand forecasting problem subject to state constraints.

One of the most straightforward methods for controlling a dynamical system with unknown transitions isbased on the certainty equivalence principle: a model of the system is fit by observing its time evolution, and a control policy is then designed by treating the fitted model as the truth [8]. Despite the simplicity of this method, it is challenging to guarantee its efficiency because small modeling errors may propagate to large, undesirable behaviors on long time horizons. As a result, most work on controlling systems with unknown dynamics has explicitly incorporated robustness against model uncertainty [11, 12, 20, 25, 35, 36]. In this work, we show that for the standard baseline of controlling an unknown linear dynamical system with a quadratic objective function, known as the Linear Quadratic Regulator (LQR), certainty equivalent control synthesis achieves better cost than prior methods that account for model uncertainty. In the case of offline control, where one collects some data and then designs a fixed control policy to be run on an infinite time horizon, we show that the gap between the performance of the certainty equivalent controller and the optimal control policy scales quadratically with the error in the model parameters.

The effectiveness of model-based versus model-free methods is a long-standing question in reinforcement learning (RL). Motivated by recent empirical success of RL on continuous control tasks, we study the sample complexity of popular model-based and model-free algorithms on the Linear Quadratic Regulator (LQR). We show that for policy evaluation, a simple model-based plugin method requires asymptotically less samples than the classical least-squares temporal difference (LSTD) estimator to reach the same quality of solution; the sample complexity gap between the two methods can be at least a factor of state dimension. For policy evaluation, we study a simple family of problem instances and show that nominal (certainty equivalence principle) control also requires a factor of state dimension fewer samples than the policy gradient method to reach the same level of control performance on these instances. Furthermore, the gap persists even when employing baselines commonly used in practice. To the best of our knowledge, this is the first theoretical result which demonstrates a separation in the sample complexity between model-based and model-free methods on a continuous control task.