RLAI Reinforcement Learning and Artificial Intelligence (RLAI)
Chapter 2 Exercises



Special Instructions for exercise 2.5
For exercise 2.5 we would like python-like pseudo code.  As psuedo code, it should express the idea of the algorithm, without getting bogged down in details of implementation.  It doesn't have to be code that will run.  Python is meant to encourage you to think this way, so we encourage you to use python conventions.  You may write array references either in python style (A[i]) or pseudo code style like Ai.  Assume that you have any support functions that you may need (like argmax for example).  Finally, you don't have to write all of the code.  Please fill in what is missing from the following algorithm:

for i in [1,2,3,...,n]
    Qi = 0
    ki = 0

Repeat Forever:
    <Insert your psuedo code here>
    <Insert your
psuedo code here>
    ...
    <Insert your
psuedo code here>

Back to main page
 
I am having a problem in understanding ex 2.6. When the question mentions:
"...then the estimate Qk(a) is a weighted average of previously received rewards with a weighting different from that in (2.7)"
is it referring to the alpha values as discussed in section 2.5?
continuing with the question :"what is the weighting on each prior reward for the general case?"
What is the general case?  
Answer:  It is referring to equation 2.6 and 2.7 from the textbook.  By general case, it means that alpha may be an arbitrary sequence, it does not need to be constant.  If you look at equation 2.6, alpha has no subscript because it is constant with respect to the step k.  The idea here is for you to derive something like was done in equation 2.7, using the general case of alphak at each timestep instead of alpha.  It is not important to define what alphak is, just that it need not be a constant value.  Hope that helps.  If not, send me an e-mail and we can talk about it.

I have what may be a related question: If we are using the step-size, Qk, and Rk+1 to create the new Qk+1, should we label the associated step-size Alpha[k] (to go with Qk) or alpha[k+1] (to go with reward k+1)? Either labelling option should work the same mathematically, but is one more "proper" than the other?

Thanks :) 
Answer: I'd say the alpha that creates Qk is is alpha(k).  But that's just me.
-Brian

And a question about 2.5

The textbook says to use greedy action selection, and the code given sets epsilon. So do we do epsilon greedy action selection?

And do we restart a "max-steps" run an infinite number of times, or do we just have an infinite run? 
Answers: Greedy greedy.  The epsilon was a typo.  Our apologies.  Infinite run is probably easiest, so do that.
-Brian