RLAI Reinforcement Learning and Artificial Intelligence (RLAI)
Reinforcement learning interface documentation (python) version 5

--Rich Sutton & Steph Schaeffer, mostly May-June 2004
The ambition of this web page is to fully describe how to use the Python module defining a standardized reinforcement learning interface.  We describe 1) how to construct an interface object for a given agent and environment, 2) the inputs and outputs of the interface object, and 3) the inputs and outputs of the functions (procedures) defining the agent and environment.  Not covered is the internal workings of the interface object or any particular agent and environment.

The RLI (Reinforcement Learning Interface) module provides a standard interface for computational experiments with reinforcement-learning agents and environments. The interface is designed to facilitate comparison of different agent designs and their application to different problems (environments). This documentation presents the general ideas of the interface. At the end is a pointer to the source code for the RLinterface class and its three methods (episode, steps, and episodes) to answer any remaining questions.

An RLinterface is a Python object, created by calling RLinterface(agentFunction, environmentFunction). The agentFunction and environmentFunction define the agent and environment that will participate in the interface. There will be libraries of standard agentFunction's and environmentFunction's, and of course you can write your own. An environmentFunction normally takes an action from the agentFunction and produces a sensation and reward, while the agentFunction does the reverse:

environmentFunction(action) ==> sensation, reward

agentFunction(sensation, reward) ==> action

(An action is defined as anything accepted by environmentFunction and a sensation is defined as anything produced by environmentFunction; rewards must be numbers.) Together, the agentFunction and environmentFunction can be used to generate episodes -- sequences of sensations s, actions a, and rewards r:

import RLinterface
rli = RLinterface(myAgent, myEnv)

rli.episode(maxSteps) ==> s0, a0, r1, s1, a1, r2, s2, a2, ..., rT, 'terminal'

where 'terminal' is a special sensation recognized by RLinterface and agentFunction. (In a continuing problem there would bejust one never-terminating episode.)

To produce the initial s0, and a0, the agentFunction and environmentFunction must also support being called with fewer arguments:

environmentFunction() ==> sensation

agentFunction(sensation) ==> action

When the environmentFunction is called in this way (with no arguments) it should start a new episode -- reset the environment to a characteristic initial state (or distribution of states) and produce just a sensation without a reward. When the agentFunction is called in this way (with just one argument) it should not try to process a reward on this step and should also initialize itself for the beginning of an episode. The agentFunction and environmentFunction will always be called in this "reduced" way before being called in the "normal" way.

Episodes can be generated by calling rli.episode(maxNumSteps) as above or, alternatively (and necessarily for continuing problems), segments of an episode can be generated by calling rli.steps(numSteps), which returns the sequence of experience on the next numSteps steps. For example, suppose rli is a freshly made RLinterface and we run it for a single step, then for one more step, and then for two steps after that:

rli.steps(1) ==> s0, a0

rli.steps(1) ==> r1, s1, a1

rli.steps(2) ==> r2, s2, a2, r3, s3, a3

Each call to rli.steps continues the current episode. To start a new episode, call rli.episode(1), which returns the same result as the first line above. Note that if rli.steps(numSteps) is called on an episodic problem it will run for numsteps even if episodes terminate and start along the way. Thus, for example,

rli.episode(1) ==> s0, a0

rli.steps(4) ==> r1, s1, a1, r2, 'terminal', s0, a0, r1, s1, a1

The method rli.episodes(numEpisodes, maxStepsPerEpisode, maxStepsTotal) is also provided for efficiently running multiple episodes.

Source Code for RLinterface Module