Reinforcement Learning and
Artificial
Intelligence (RLAI)
Chapter
2 Exercises
Special Instructions for exercise 2.5 For exercise 2.5 we would like python-like pseudo code. As
psuedo code, it should express the idea of the algorithm, without
getting bogged down in details of implementation. It doesn't have
to be code that will run. Python is meant to encourage you to
think this way, so we encourage you to use python conventions.
You may write array references either in python
style (A[i]) or pseudo code style like Ai.
Assume that you have any support functions that you may need (like
argmax for example). Finally, you don't have to write all of the
code. Please fill in what is missing from the following algorithm:
for i in [1,2,3,...,n] Qi = 0 ki = 0
Repeat Forever: <Insert your
psuedo code here> <Insert your psuedo code here> ... <Insert your psuedo code here>
I am having a problem in understanding ex
2.6. When the question mentions:
"...then the estimate Qk(a) is a weighted average of previously
received rewards with a weighting different from that in (2.7)"
is it referring to the alpha values as discussed in section 2.5?
continuing with the question :"what is the weighting on each prior
reward for the general case?"
What is the general case?
Answer: It is referring
to equation 2.6 and 2.7 from the textbook. By general case, it
means that alpha may be an arbitrary sequence, it does not need to be
constant. If you look at equation 2.6, alpha has no subscript
because it is constant with respect to the step k. The idea here
is for you to derive something like was done in equation 2.7, using the
general case of alphak at each timestep instead of
alpha. It is not important to define what alphak is,
just that it need not be a constant value. Hope that helps.
If not, send me an e-mail and we can talk about it.
I have what may be a related question: If we
are using the step-size, Qk, and Rk+1 to create the new Qk+1, should we
label the associated step-size Alpha[k] (to go with Qk) or alpha[k+1]
(to go with reward k+1)? Either labelling option should work the same
mathematically, but is one more "proper" than the other?
Thanks :)
Answer: I'd say the alpha that
creates Qk is is alpha(k). But that's just me.
-Brian
And a question about 2.5
The textbook says to use greedy action selection, and the code given
sets epsilon. So do we do epsilon greedy action selection?
And do we restart a "max-steps" run an infinite number of times, or do
we just have an infinite run?
Answers: Greedy greedy.
The epsilon was a typo. Our apologies. Infinite run is
probably easiest, so do that.
-Brian
"...then the estimate Qk(a) is a weighted average of previously received rewards with a weighting different from that in (2.7)"
is it referring to the alpha values as discussed in section 2.5?
continuing with the question :"what is the weighting on each prior reward for the general case?"
What is the general case?
Answer: It is referring to equation 2.6 and 2.7 from the textbook. By general case, it means that alpha may be an arbitrary sequence, it does not need to be constant. If you look at equation 2.6, alpha has no subscript because it is constant with respect to the step k. The idea here is for you to derive something like was done in equation 2.7, using the general case of alphak at each timestep instead of alpha. It is not important to define what alphak is, just that it need not be a constant value. Hope that helps. If not, send me an e-mail and we can talk about it.