RLAI Reinforcement Learning and Artificial Intelligence (RLAI)
M. G. Lagoudakis and R. Parr. "Model-Free Least-Squares policy iteration", Machine Learning 4(2003) 1107-1149, 2003

Author: Anna October, 2004
Download Here
First Author HomePage


Abstract:

       
We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems.

Keywords:
Reinforcement Learning, Markov Decision Processes, Approximate Policy Iteration, Value-Function Approximation, Least Square Methods.
 

Bibtex:

@misc{ LSPI03,
author = "Michail Lagoudakis Mgl",
title = "Journal of Machine Learning Research 4 (2003) 1107-1149 Submitted 8/02;
Published 12/03 Least-Squares Policy Iteration",
url = "citeseer.ist.psu.edu/661096.html" }


Comments:

    I think this is a solid paper with nice mathematical views which covers most important issues about Least-square TD methods. My main concern about the whole idea is that gathering the samples for the LSPI method can make a huge difference. As stated by the authors in the bicycle problem, almost all of the trajectories after 20th timestep were useless since they were generated using a random policy and it would bias the results.  So what if we have biased results in a limited part of the state representation and we are not aware of that?
   
    The discussion of two different approaches (Bellman residual error and fixed-point approximation) was comprehensive.

    I liked the idea of scaling feature vectors in order to make gradient methods more stable. I am also interested to see how easy this idea can be transfered to on-line and on-policy form. Comparing results with LSTD method would also help revealing more shadowed areas. (Alborz)