Reinforcement Learning and
Artificial
Intelligence (RLAI)
Chapter
2 Programming Exercises
Programming Exercises (Bandit
Problem):
Due: Thursday,
September 30th
You will need the bandit.py source code that is later on this page
or can be downloaded here:
Tasks
Find the value of numbandits such that performance runs in about
30 seconds
Repeat the performance calculation for values of alpha in [2**0,
2**-1,..., 2**-8] (for the most reliable results, leave Qstar
unchanged as you vary alpha)
Briefly describe and explain the trend in performance as a
function of alpha
Find the best alpha in this set
Using the best alpha, repeat for epsilon in [2**-2, 2**-3,...,
2**-10]
Briefly describe and explain the trend in performance as a
function of epsilon
Why is the relative performance of the greedy method so unlike
that in the book?
Here is the Bandit Program - copy or download here
# This program implements the action-value method from Section 2.6 for
solving
# bandit problems. This method uses a fixed step-size parameter,
alpha, and
# epsilon-greedy action selection with a fixed exploration parameter,
epsilon.
from random import random #
random() returns a random real number in [0,1)
from random import randrange # randrange(n) returns a
random integer in {0,1,2,...,n}
from random import
gauss # gauss(0,1)
returns a normally distributed number
# with zero mean and unit variance
n =
10
# number of possible actions or arms
numbandits = 200 # number of
different bandits (with different true action values)
Qstar = [[gauss(0,1) for i in range(n)] # array of true action
values for each bandit
for j in range(numbandits)]
def bandit(banditnumber,action):
"""Do one play with the bandit; return reward"""
global Qstar
return Qstar[banditnumber][action] + gauss(0,1)
def playbandit(banditnum=0, numplays=1000, alpha=0.1, epsilon=0.1):
"""Play a bandit for some number of plays; return
average reward"""
global n
Q = [0.0 for i in
range(n)] # array to
hold estimated action values
totalreward = 0
for t in range(numplays):
if random() <
epsilon: #
epsilon-greedy action selection
action = randrange(n)
else:
action = argmax(Q)
reward =
bandit(banditnum,action)
totalreward = totalreward +
reward
Q[action] = Q[action] +
alpha * (reward - Q[action]) # learning update
return totalreward/numplays
def argmax(array):
"""Return the index of the maximal value in the
array/list, breaking ties randomly;
array must have at least one element"""
bestindex = 0
bestvalue = array[0]
numbests = 1
for i in range(1,len(array)):
value = array[i]
if value < bestvalue:
pass
elif value > bestvalue:
bestvalue = value
bestindex = i
numbests = 1
elif value == bestvalue:
numbests += 1
if
random() < 1.0/numbests:
bestindex = i
return bestindex
def mean(list):
"""Return arithmetic mean (average) of a list's
values"""
sum = 0
for item in list:
sum += item
return sum/float(len(list))
def performance(alpha,epsilon):
return mean([playbandit(i,1000,alpha,epsilon) for i
in range(numbandits)])
print performance(0.1,0.1)
Submission
instructions
Please e-mail a document with the answers to these questions to the
course account if possible (c499@ugrad.cs.ualberta.ca). I would like a
.txt, .rtf, or .pdf -
please no Word files or crazy linux/unix word processor files.
Alternatively, you may also
hand in the assignment hand written at the beginning of
class - but electronic format is preferred. Finally, please
include your name and student number for either type of submission.
Question: Does
"Repeat the performance calculation" in
task 2 mean find the new value for numbandits which takes 30 seconds,
or report on how fast it ran with the value for numbandits found in
part 1?
Answer: I interpret this
as.
Part 1: Find out how the code runs so that it doesn't take forever on
your machine. More bandits means more accurate results, so the
higher this number the better, but if it takes more than 30 seconds you
can scale it back.
Part 2: Calculate the performance (in terms of score) of the agent in
the bandit scenario that you found in part 1 for all sorts of various
parameters for the rest of the assignment.
-Brian
Question: For question 7 of
the Bandit assignment, it is asked that we compare perfomance of the
greeedy method with the book. However, the Bandit program uses an
epsilon-greedy action selection method. Was assignment suppose to
say epsilon-greedy instead of just greedy? Or should we configure
the program to use greedy action selection?
Answer: Specifically,
first I should mention that you are (in my option) looking specifically
at Figure 2.1 from the text when making comparisons. The
experimental setup described on page 28 is exactly what you are doing,
except you will probably be using less than 2000 bandits in order to
keep your experimental run time manageable. The graph in Figure
2.1 (top) is displaying the same sort of result that your
"Performance(alpha,epsilon)" method is calculating in the
assignment.
Note: from the errata,
the bottom line of that top graph is the greedy agent, although it is
unlabeled.
Now, although we haven't asked you to run bandit.py with a strictly
greedy (e=0) setting, you have run it with something close to
that. There is also nothing stopping you from explicitly running
it with e=0.
Some of these questions are a little open ended because the idea is not
for you to fill in the blank, but for you to run some experiments,
present your observations, and explain variances between your results
and those in the book. When you're experimental setup is the
same, and your answers are different - someone has some explaining to
do. One of the mainy joys of research.
:P
-Brian
Answer: I interpret this as.
Part 1: Find out how the code runs so that it doesn't take forever on your machine. More bandits means more accurate results, so the higher this number the better, but if it takes more than 30 seconds you can scale it back.
Part 2: Calculate the performance (in terms of score) of the agent in the bandit scenario that you found in part 1 for all sorts of various parameters for the rest of the assignment.
-Brian
Question: For question 7 of the Bandit assignment, it is asked that we compare perfomance of the greeedy method with the book. However, the Bandit program uses an epsilon-greedy action selection method. Was assignment suppose to say epsilon-greedy instead of just greedy? Or should we configure the program to use greedy action selection?
Answer: Specifically, first I should mention that you are (in my option) looking specifically at Figure 2.1 from the text when making comparisons. The experimental setup described on page 28 is exactly what you are doing, except you will probably be using less than 2000 bandits in order to keep your experimental run time manageable. The graph in Figure 2.1 (top) is displaying the same sort of result that your "Performance(alpha,epsilon)" method is calculating in the assignment.
Note: from the errata, the bottom line of that top graph is the greedy agent, although it is unlabeled.
Now, although we haven't asked you to run bandit.py with a strictly greedy (e=0) setting, you have run it with something close to that. There is also nothing stopping you from explicitly running it with e=0.
Some of these questions are a little open ended because the idea is not for you to fill in the blank, but for you to run some experiments, present your observations, and explain variances between your results and those in the book. When you're experimental setup is the same, and your answers are different - someone has some explaining to do. One of the mainy joys of research. :P
-Brian