 |
Reinforcement Learning and
Artificial
Intelligence (RLAI)
|
Definitions
and Interfaces
|
Edited by Leah Hackman
The ambition
of this page is to provide concise and coherent explanations of the
RL-Glue function protocol.
Agent
Functions
Every agent must define all of the following routines.
Note these functions are only accessed by the RL-Glue. Experiment
programs should not try to bypass the Glue and directly access these
functions.
agent_start: agent_start(first_observation) --> first_action
Given the
first_observation (the observation of the agent in the start state) the
agent must then return the action it wishes to
perform. This is called once if the task is continuing, else it happens
at the beginning of each episode.
agent_step: agent_step( reward, observation) -->
action
This is the most
important function of the agent. Given the reward garnered by the
agent's previous action, and the resulting observation,
choose the next action to take. Any learning (policy improvement)
should be done
through this function.
agent_end: agent_end(reward)
If the agent is in
an episodic
environment, this function will be called after the terminal
state is entered. This allows for any final learning updates. If the
episode is terminated prematurely (ie a benchmark cutoff before
entering a terminal state) agent_end
is NOT called.
agent_init:
agent_init(task_specification)
This
function will be called first, even before agent_start. The
task_specification is a
description of important experiment information, including but not
exclusive to a description of the state and action space. The RL-Glue
standard for writing task_specification
strings is found here.
In agent_init, information
about the environment is extracted from the task_specification and then
used to set up any necessary resources (for example, initialize the
value function to a prelearning state).
agent_cleanup: agent_cleanup()
This
function is called at the end of a run/trial and can
be used to free any resources which may
have allocated in agent_init. Calls to agent_cleanup should be in a one
to one ratio with the calls to agent_init.
agent_freeze:
agent_freeze()
Signals to the
agent that training has
ended. Requests that the agent freeze its
current policy and value function (ie: stops learning and exploration).
agent_message:agent_message(input_message) --->
output_message
The
agent_message function is a jack of
all trades and master of none. Having no particular functionality, it
is up to the user to determine what
agent_message
should implement. If there is any information which needs to be passed
in
or out of
the agent, this message should do it. For example, if it is desirable
that an
agent's learning parameters be tweaked mid experiment, the author could
establish an input string that triggers this action. Likewise, if the
author wished to extract a representation of the value function, they
could establish an input string which would cause agent_message to
return the desired information.
Environment
Functions
Every environment must define all of the following
routines. Note these functions are only
accessed by the RL-Glue.
Experiment programs should not try to bypass the Glue and directly
access these functions.
env_start: env_start() --> first_observation
For a continuing
task this is done once. For
an episodic task, this is done at the beginning of each episode.
Env_start assembles a first_observation
given the agent is in the start state. Note the start state cannot also
be a terminal state.
env_step: env_step(action) --> reward,
observation, terminal
Complete one step
in the environment. Take
the action passed in and determine what the reward and next state are
for that transition.
env_init:
env_init() -->
task_specification
This routine will
be called exactly once for each trial/run.
This function is an ideal place to initialize all environment
information and allocate any resources required to represent the
environment. It must return a task_specification
which adheres to the task
specification language. A task_specification
stores information regarding the observation and action space, as well
as whether the task is episodic or continuous.
env_get_state:
env_get_state() -->
state_key
The state_key
is a compact representation of the current state of the
environment such that at any point in the future, provided with the
state_key, the environment could return to thatstate. Note
that
this does not include the agent's value function, it is merely
restoring the details of the environment. For example, in a static grid
world
this would be as simple as the position of the agent.
env_set_state:
env_set_state(state_key)
Given the state_key, the environment should
return to it's exact formation when the state_key was obtained.
env_get_random_seed:
env_get_random_seed() -->
random_seed_key
Saves
the random seed
object used by
the environment such that it can be restored upon
presentation of
random_seed_key.
env_set_random_seed:
env_set_random_seed(random_seed_key)
Sets
the random seed used by the environment. Typically it is advantageous
for the experiment program to control the randomness of the
environment. Env_set_random_seed can be used in conjunction with
env_set state to save and restore a random_seed such that the
environment will
behave exactly the same way it has previously when it was in this state
and given the same actions.
env_cleanup: env_cleanup()
This
can be used
to release any allocated resources. It will be called once for every
call to env_init.
env_message:env_message(input_string)
---> output_string
Similar to
agent_message, this function allows
for any message passing to the environment required by the experiment
program. This may be used to modify the environment mid experiment. Any
information that
needs to passed in or out of the environment can be handled by this
function.
Interface Routines
Provided by the
RL-Glue
The
following
built in RL-Glue functions are provided primarily for the use of the
experiment program writers. Using these functions, the experiment
program gains access to the corresponding environment and agent
functions. The implementation of these routines are to
be standard across all RL-Glue users. To ensure
agents/environments/experiment programs can be exchanged between
authors with no changes necessary,
users should not change the RL-Glue interface code provided.
To understand the
following, it is helpful to think of an episode as consisting of
sequences of observations, actions, and rewards that are indexed by
time-step as follows:
o0,
a0, r1, o1, a1,
r2, o2, a2,
..., rT, terminal_observation
where the episode lasts
T
time steps (
T may be infinite) and
terminal_observation
is a special, designated observation signaling the end of the episode.
RL_init:
RL_init()
agent_init(env_init())
This initializes everything, passing the
environment's task_specification
to the agent.
This should be called at the beginning of every trial.
RL_start:
RL_start()
--> o0, a0
global
upcoming_action
o
= env_start()
a
= agent_start(o)
upcoming_action
= a
return o,a
Do
the first step of
a run or
episode. The action is saved in upcoming_action
so that it can be used on the next step.
RL_step:
RL_step()
--> rt, ot, terminal, at
global upcoming_action
r,o,terminal
= env_step(upcoming_action)
if terminal
== true
agent_end(r)
return r,
o,terminal
else
a = agent_step(r, o)
upcoming_action
= a
return r, o, terminal, a
Take
one step. RL_step
uses the saved action and saves the
returned action for the next step. The action returned from one
call must be used in the next, so it is better to handle this
implicitly so that the user doesn't have to keep track of the
action. If the end-of-episode observation
occurs, then no action is returned.
RL_episode:
RL_episode(steps)
num_steps
= 0
o, a = RL_start()
num_steps = num_steps + 1
list = [o, a]
while o != terminal_observation{
if(steps !=0 and num_steps >= steps)
end
else
r, o, a
= RL_step()
list
= list + [r, o, a]
num_steps = num_steps + 1
}
agent_end(r)
Do
one episode until
a termination observation occurs or
until steps steps have
elapsed, whichever comes first. As you might imagine, this is
done by calling
RL_start, then RL_step until the terminal
observation occurs. If steps
is set to 0, it is taken to be the case where there is no limitation on
the number of steps taken and RL_episode will continue until a
termination observation occurs. If no terminal observation is reached
before num_steps is reached, the agent does not call agent_end, it
simply stops.
RL_return:
RL_return() --> return
Return
the cumulative
total reward of
the current or just completed episode. The collection of all the
rewards received in an episode (the return) is done within RL_return
however, any
discounting of rewards must be
done inside the environment or agent.
RL_num_steps:
RL_num_steps()
--> num_steps
Return
the number of steps elapsed in the current or just completed episode.
RL_cleanup:
RL_cleanup()
env_cleanup()
agent_cleanup()
Provides
an
opportunity to reclaim
resources allocated by RL_init.
RL_set_state:
RL_set_state(State_key)
env_set_state(State_key)
Provides an opportunity to
reset the
state (see
env_set_state for details).
RL_set_random_seed:
RL_set_random_seed(Random_seed_key)
env_set_random_seed(Random_seed_key)
Provides an opportunity to
reset the
random seed key (see
env_set_random_seed
for details).
RL_get_state:
RL_get_state() --> State_key
return
env_get_state()
Provides an opportunity to
extract
the state key from the environment (see
env_get_state
for details).
RL_get_random_seed:
RL_get_random_seed() --> Random_seed_key
return
env_get_random_seed()
Provides an opportunity to
extract the random seed key from the environment (see
env_get_random_seed for details).
RL_freeze:
RL_freeze()
agent_freeze()
Calls
the
agent_freeze method to freeze the agents policy. This is typically used
to switch the agent from training mode to testing mode where no more
learning and exploring will happen.
RL_agent_message:
RL_agent_message(input_message_string) -->
output_message_string
return
agent_message(input_message_string)
This
message passes the input string to the agent and returns the reply
string given by the agent. See agent_message
for more details.
RL_env_message:
RL_env_message(input_message_string) -->
output_message_string
return
env_message(input_message_string)
This
message passes the input string to the environment and returns the
reply string given by the environment. See env_message for more details.