 |
Reinforcement Learning Benchmarks
and Bake-offs |
RL-Glue concepts
|
This page introduces the main concepts of the RL-GLue,
a standard interface for interconnecting reinforcement
learning agents (controllers) and RL environments
(problems, systems, plants). The interface specifies routines
that must be implemented by authors of agents
and of environments so that they can be plugged together. This
page describes the inputs and outputs of each routine.
The RL-Glue is intended to provide a foundation for building
benchmarks for RL agents. A secondary goal is to support RL
competitions in which agents are compared in their performance on new
problems that are only revealed at the time of a competition.
The RL-Glue is kept simple so as to maximize the
ease with which environments and agents can be written. The Glue
consists of a small number of routines that must be
defined, plus optional
routines that provide additional functionality or convenience.
Given these, the Glue provides a further set of routines that can be
used to write
general benchmarks.
The RL-Glue presented here, let's call it RL-Glue 1.0,
assumes that there is only one agent and one environment existent at
the same time. (Although it would be natural to generalize this
to multiple
agents and multiple environments in an object-oriented fashion, it was
not done in this
version so as to maximize simplicity and language independence.)
A user will bring together three things: a learning agent, an
environment, and a benchmark or test that he would like to run on their
combination. Each would be compiled and then combined into one
executable, which it then runs, perhaps several times with parameter
variations. The Glue does not make strong assumptions about
the benchmark. It just provides routines for interconnecting the
agent
and environment, leaving it up to the benchmark writer how to use
them.
Typically, a benchmark involves averaging over a sequence of
independent runs,
each starting with a naive (before learning) agent and proceeding
through a number of episodes or a single long episode. A
performance measure is computed for each run (e.g., the average reward
per episode on the final 100 episodes) and then averaged over
runs to produce an overall performance measure for this
agent-environment combination. An informal example is given below. The agent and environment must be
such that each run is completely independent of the others. In
particular, the agent cannot in any way use experience on earlier runs
to influence its performance on later runs. The agent should
define agent_init in such a way
that this is true. Code for some example benchmarks are given in Python and C.
In what follows, we use the term "observation" for the information
returned by the environment on each time step. An important special
case is that in which the observation is the state of the
environment. The general case, which we treat here, includes
partially-observable Markov decision processes.
Table of contents for this page:
Episodic and continuing tasks
An episodic task is one in which the agent-environment interaction is
divided into a sequence of trials, or episodes.
Each episode starts in the same state, or in a state chosen from the
same distribution, and ends when the environment reaches a terminal
state, which it signals by returning a special terminal observation.
The environment must not retain any state from episode to episode---it
must generate observations with the same probability distribution on
every episode with the same history of observations and actions (since
the beginning of the episode). The agent, on the other hand, is
normally expected to change state across episodes via its learning
process.
Formally, an episodic environment is any environment that might
generate the terminal observation. A terminal agent is an agent
that can respond appropriately to the terminal observation (by
appropriately implementing agent_start
and agent_end).
Formally, a continuing task is one in which there is one episode that
starts once and goes on forever.
Environment routines
Every environment (plant, simulator) must implement the following two
routines.
env_start()
--> first_observation
For a continuing task, this is done
once. For an episodic task,
this is done at the beginning of each episode. Note no reward is
returned. In the case of an episodic environment, end-of-episode
is signaled by a special observation. This special observation
cannot be returned by env_start.
env_step(action) --> reward, observation
Do one time step of the environment.
No other functionality is required from the environment. The
routines described below are optional
and need only be implemented if the environment writer finds them
convenient or desires the additional functionality.
env_init()
--> task_specification
This routine will be called exactly
once. It can be used to initialize the environment and/or to
provide, as its returned value, a specification of its i/o
interface - the space of actions that the environment accepts and the
space of rewards
and observations it returns. The
task_specification is
optional and will be made available to the
agent via the routine
agent_init. See the proposal
for a
task
specification language.
env_get_state() --> state_key
Saves the current state of the
environment such that it can be recreated
later upon presentation of state_key.
The state_key could in fact
be the
state object, but returning just a key (a logical pointer to the state
information) saves passing the state back and forth and avoids giving
the agent direct access to the state.
Restores the environment to the state
it was in when state_key was
obtained. Generates an error if state_key
was not previously generated by env_get_state with this
environment.
env_get_random_seed() --> random_seed_key
Saves the random seed object used by
the environment such that it can be restored upon
presentation of random_seed_key.
Same comments as above for env_get_state.
env_set_random_seed(random_seed_key)
Restores the random seed used by the
environment such that the environment will behave exactly the same way
it has previously when it was in this state and given the same
actions. Typically used in conjunction with env_set_state. Generates
an error if random_seed_key was not previously generated by env_get_random_seed with this environment.
This routine is called once per call to
RL_init.
RL_init
may allocate memory or other resources that will be released by this
routine.
Agent routines
Every agent (controller) must
implement the following two routines.
agent_start(first_observation) --> first_action
Do the first step of a run or
episode. Note that there is no reward input.
agent_step(reward,observation) --> action
Do one step of the agent.
If an agent is to be used with episodic environments (environments that
return terminal observations) then it must implement the following
routine.
Do the final step of the episode.
If multiple runs will be made with the agent, then it must be returned
to its initial pre-learning state prior to each run. The
following
routine is called at the beginning of each run.
agent_init(task_specification)
Initializes the agent to a naive
(pre-learning) state. The
task_specification,
if given, is a description on the environment's
i/o interface according to a standard description language. See
the proposal for a
task specification
language. The agent may ignore the
task_specification.
If memory or other resources are allocated when the agent is
initialized, then the following routine can be used to release them.
Interface routines provided by the RL-Glue
Benchmark
writers will typically access the RL Glue entirely through the
interface routines described in this section. These routines are meant
to never be changed by users, but to be a permanent, defining part of
the RL Glue. They are implemented by appropriate calls to
the agent and environment routines described in the preceeding
sections. The complete code for these routines can be found in Python and C.
The interface routines can be used to write a variety of specific
benchmarks, examples of which were noted earlier. Python-like
psuedocode is given here for each routine to
suggest its specific relationship to the agent and environment
routines. To understand the following, it is helpful to
think of an episode as consisting of observations, actions, and rewards
that are time-step indexed as follows:
o0,
a0, r1, o1, a1,
r2, o2, a2,
..., rT, terminal_observation
where the episode lasts T
time steps (T may be infinite) and terminal_observation
is a special, designated observation signaling the end of the episode.
RL_init()
agent_init(env_init())
Initialize everything, passing the
environment's i/o specification to the agent.
RL_start()
--> o0, a0
global
upcoming_action
s = env_start()
a
= agent_start(s)
upcoming_action
= a
return s,a
Do the first step of a run or
episode. The action is saved in upcoming_action
so that it can be used on the next step.
RL_step()
--> rt, ot, at
global upcoming_action
r,s = env_step(upcoming_action)
if s == terminal_observation
agent_end(r)
return r, s
else
a = agent_step(r, s)
upcoming_action
= a
return r, s, a
Do one time step. RL_step
uses the saved action and saves the
returned action for the next step. The action returned from one
call must be used in the next, so it is better to handle this
implicitly so that the user doesn't have to keep track of the
action. If the end-of-episode observation
occurs, then no action is returned.
RL_episode(steps) --> o0, a0,
r1, o1, a1,
..., rT
or o0,
a0, r1, o1, a1,
..., rsteps,
osteps, asteps
s, a = RL_start()
list = [s, a]
while s != terminal_observation:
r, s, a = RL_step()
list = list + [r, s, a]
return list minus last
two elements
Do one episode until termination or
until steps steps have
elapsed, whichever comes first. As you might imagine, this is
done by calling
RL_start, then RL_step until the terminal
observation occurs. The psuedocode shown is
specific to the case in which the episode is completed in less that steps steps.
Return the cumulative total reward of
the current or just completed episode. Any discounted must be
done inside the environment.
RL_num_steps()
--> num_steps
Return
the number of steps elapsed in the current or just completed episode.
RL_cleanup()
env_cleanup()
agent_cleanup()
Provides an opportunity to reclaim
resources allocated by RL_init.
Proposed extensions to the RL-Glue
concepts
To increase the functionality of the interface we propose new agent and
environment routines and their corresponding interface routines to
allow freezing an agents policy for testing and the standardization of
environment randomness.
agent_freeze()
Signals to the agent that training has
ended and testing phase will begin. Allows the agent to freeze its
current policy and stop learning and expolaration.
env_standardize_randomness()
Tells the environment to generate a set
of random numbers to be used in its decision making. Typically, the
environment would store the random values in a data file, which the
env_step method would query on each time step.
Corresponding interface routines.
RL_freeze()
Calls the agent_freeze method to freeze
the agents policy.
RL_standardize_randomness()
Calls the environment to establish a standard set of random numbers,
which are platform and invocation independent.
Benchmarking routines
Given all of the above, users will write benchmark routines that
produce clear performance measures, perhaps something like the average
return over 1000 episodes, averaged again over 100 runs:
RL_benchmark()
--> performance
performance = 0
for run
= 1..100
RL_init()
sum = 0
for episode = 1..1000
RL_episode(10000000)
sum = sum + RL_return(1.0)
performance = performance + sum/1000.0
RL_cleanup()
return performance/100.0
The idea is to provide one overall measure of performance defining the
benchmark.