 |
Reinforcement Learning and
Artificial
Intelligence (RLAI)
|
R. Schoknecht and A. Merke. TD(0) converges
provably faster than the residual gradient algorithm.
ICML, pp 680–687, 2003. |
Download
Here
Second Author HomePage
Abstract:
In Reinforcement Learning (RL) there has been
some experimental evidence that the residual gradient algorithm
converges slower than the TD(0) algorithm. In this paper, we use the
concept of asymptotic convergence rate to prove that under certain
conditions the synchronous off-policy TD(0) algorithm converges faster
than the synchronous off- policy residual gradient algorithm if the
value function is represented in tabular form. This is the 1st
theoretical result comparing the convergence behaviour of two RL
algorithms. We also show that as soon as linear function approximation
is involved no general statement concerning the superiority of one of
the algorithms can be made.
Keywords: Reinforcement Learning, TD(0), Residual
Gradient, Convergence
Bibtex:
@inproceedings{DBLP:conf/icml/SchoknechtM03,
author = {Ralf Schoknecht and
Artur Merke},
title = {TD(0) Converges Provably Faster than the Residual Gradient
Algorithm.},
booktitle = {ICML},
year = {2003},
pages = {680-687},
crossref = {DBLP:conf/icml/2003},
bibsource = {DBLP, http://dblp.uni-trier.de}
}
Comments:
- This paper is the first work on
comparing the asymptotic convergence of TD(0) and Residual Gradient
(RG) method in synchronous off-policy form.
- Using Tabular case TD(0)
converges faster than RG.
- If linear function approximation
is used either of the methods can perform better with different set of
parameters. So TD(0) is not always superior.
- I was really confused at first
sight when looking at the A and b matrices, but then after a bit of
math I realized that they are the same as our A and b with less
readability but in shorter notation.
- The sum of mean squared error is
weighted by the frequency of visiting states.
- P,D and R in equ. (3) are
evaluated thorough maximum likelihood.
- P(i,j) = # transactions
Si->Sj / # visits of Si
- R(i) = Sum of Rewards from Si /
# visits of Si
- I still need to think more about
the fact that they defined a new criteria for alpha and then used it to
define the optimal convergence rate. What if we change the definition
of alpha, would the proof hold?
- If tabular state representation
is used the TD(0) method is minimizing the Bellman error, but if linear
function approximation is used it is not longer minimizes the Bellman
error. Instead it minimizes the ||w^n-w^*_TD||^2.
- What if the problem and the
trajectory led us to a an A matrix with imaginary eigenvalues ?
- What does it mean for Residual
Gradient matrix (A_RG) to have same eigenvalues?
- It is not clear to me how easy
it is to compute the optimum alpha values using lemma 1 in general case
when eigenvalues have nonzero imaginary parts.<>
- I like the Figure 3. I think
it
shows what many people in RL where looking for.(Alborz)
Further, the paper has a final (numerical) result that shows that RG converges faster than TD(0) for large discount factors. So what is then the significance of this result?