Download
Here
First Author HomePage
Abstract:
Abstract. TD(λ) is a popular family of algorithms for approximate
policy evaluation in large MDPs. TD(λ) works by incrementally updating
the value function after each observed transition. It has two major
drawbacks: it may make inefficient use of data, and it requires the
user to manually tune a stepsize schedule for good performance. For the
case of linear value function approximations and λ=0, the Least-Squares
TD (LSTD) algorithm of Bradtke and Barto (1996, Machine learning,
22:1–3, 33–57) eliminates all stepsize parameters and improves data
efficiency. This paper updates Bradtke and Barto’s work in three
significant ways. First, it presents a simpler derivation of the LSTD
algorithm. Second, it generalizes from λ=0 to arbitrary values of λ; at
the extreme of λ=1, the resulting new algorithm is shown to be a
practical, incremental formulation of supervised linear regression.
Third, it presents a novel and intuitive interpretation of LSTD as a
model-based reinforcement learning technique.
Keywords: reinforcement learning, temporal
difference learning, value function approximation, linear least-squares
methods