RLAI Reinforcement Learning and Artificial Intelligence (RLAI)
Chapter 2 Programming Exercises

Programming Exercises (Bandit Problem):

    Due: Thursday, September 30th
 

You will need the bandit.py source code that is later on this page or can be downloaded here:

Tasks

  1. Find the value of numbandits such that performance runs in about 30 seconds
  2. Repeat the performance calculation for values of alpha in [2**0, 2**-1,..., 2**-8] (for the most reliable results, leave Qstar unchanged as you vary alpha)
  3. Briefly describe and explain the trend in performance as a function of alpha
  4. Find the best alpha in this set
  5. Using the best alpha, repeat for epsilon in [2**-2, 2**-3,..., 2**-10]
  6. Briefly describe and explain the trend in performance as a function of epsilon
  7. Why is the relative performance of the greedy method so unlike that in the book?
Here is the Bandit Program - copy or download here

# This program implements the action-value method from Section 2.6 for solving
# bandit problems.  This method uses a fixed step-size parameter, alpha, and
# epsilon-greedy action selection with a fixed exploration parameter, epsilon.

from random import random       # random() returns a random real number in [0,1)
from random import randrange    # randrange(n) returns a random integer in {0,1,2,...,n}
from random import gauss         # gauss(0,1) returns a normally distributed number
                                             #     with zero mean and unit variance
                                       

n = 10                       # number of possible actions or arms
numbandits = 200       # number of different bandits (with different true action values)

Qstar = [[gauss(0,1) for i in range(n)]  # array of true action values for each bandit
             for j in range(numbandits)] 
   
def bandit(banditnumber,action):
    """Do one play with the bandit; return reward"""
    global Qstar
    return Qstar[banditnumber][action] + gauss(0,1)

def playbandit(banditnum=0, numplays=1000, alpha=0.1, epsilon=0.1):
    """Play a bandit for some number of plays; return average reward"""
    global n
    Q = [0.0 for i in range(n)]         # array to hold estimated action values
    totalreward = 0
    for t in range(numplays):
        if random() < epsilon:          # epsilon-greedy action selection
            action = randrange(n)
        else:
            action = argmax(Q)
        reward = bandit(banditnum,action)
        totalreward = totalreward + reward
        Q[action] = Q[action] + alpha * (reward - Q[action])    # learning update
    return totalreward/numplays
       
def argmax(array):
    """Return the index of the maximal value in the array/list, breaking ties randomly;
    array must have at least one element"""
    bestindex = 0
    bestvalue = array[0]
    numbests = 1
    for i in range(1,len(array)):
        value = array[i]
        if value < bestvalue:
            pass
        elif value > bestvalue:
            bestvalue = value
            bestindex = i
            numbests = 1
        elif value == bestvalue:
            numbests += 1
            if random() < 1.0/numbests:
                bestindex = i
    return bestindex

def mean(list):
    """Return arithmetic mean (average) of a list's values"""
    sum = 0
    for item in list:
        sum += item
    return sum/float(len(list))

def performance(alpha,epsilon):
    return mean([playbandit(i,1000,alpha,epsilon) for i in range(numbandits)])

print performance(0.1,0.1)


    Submission instructions

Please e-mail a document with the answers to these questions to the course account if possible (c499@ugrad.cs.ualberta.ca).  I would like a .txt, .rtf, or .pdf - please no Word files or crazy linux/unix word processor files.  Alternatively, you may also hand in the assignment hand written at the beginning of class - but electronic format is preferred.  Finally, please include your name and student number for either type of submission.

Back to main page
Question: Does "Repeat the performance calculation" in task 2 mean find the new value for numbandits which takes 30 seconds, or report on how fast it ran with the value for numbandits found in part 1? 

Answer:  I interpret this as.
Part 1: Find out how the code runs so that it doesn't take forever on your machine.  More bandits means more accurate results, so the higher this number the better, but if it takes more than 30 seconds you can scale it back.

Part 2: Calculate the performance (in terms of score) of the agent in the bandit scenario that you found in part 1 for all sorts of various parameters for the rest of the assignment.
-Brian

Question: For question 7 of the Bandit assignment, it is asked that we compare perfomance of the greeedy method with the book.  However, the Bandit program uses an epsilon-greedy action selection method.  Was assignment suppose to say epsilon-greedy instead of just greedy?  Or should we configure the program to use greedy action selection?

Answer:  Specifically, first I should mention that you are (in my option) looking specifically at Figure 2.1 from the text when making comparisons.  The experimental setup described on page 28 is exactly what you are doing, except you will probably be using less than 2000 bandits in order to keep your experimental run time manageable.  The graph in Figure 2.1 (top) is displaying the same sort of result that your "Performance(alpha,epsilon)" method is calculating in the assignment. 

Note: from the errata, the bottom line of that top graph is the greedy agent, although it is unlabeled.

Now, although we haven't asked you to run bandit.py with a strictly greedy (e=0) setting, you have run it with something close to that.  There is also nothing stopping you from explicitly running it with e=0.

Some of these questions are a little open ended because the idea is not for you to fill in the blank, but for you to run some experiments, present your observations, and explain variances between your results and those in the book.  When you're experimental setup is the same, and your answers are different - someone has some explaining to do.  One of the mainy joys of research.  :P
-Brian