Reinforcement Learning and
Artificial
Intelligence (RLAI)
Quick
Start Guide to Writing Agents, Environments and Experiments
Edited by Leah Hackman
The ambition
of this web
page is to provide a bare minimum guide to writing a first agent,
environment and/or experiment. For a more detailed discussion on
getting started click here.
The following discussion is based on this sample
pseudo code which is modeled after this Sarsa
Agent from RL-Glue. The details of implementation have been hidden
to
avoid getting caught up in minor memory management details etc. The
non-RL-Glue functions are named after what these portions of
the code should be doing, no details are provided however it should be
apparent where the corresponding code lies within the Sarsa Agent.
No matter the language, you must include the
RL_common file related to
your language in your code. For example, in C/C++ you must #include "RL_common.h" in your
Agent file.
In this example, agent_init takes in the task_specification,
parses it with the parser included in the RL-Glue
utilities, allocates memory to store the actions and observations,
and initializes the value function. This task_specification parser is
not currently available for languages outside of C/C++, though it is
not hard to write one for any given language. One thing to note is that
agent_init is not called per episode, but rather at the beginning of a
trial so values which should persist between episodes (such as the
value function) should be
initialized here.
Agent_start decides what the first action should be
based on the initial observation. In this example the actions are
chosen egreedily if the agent hasn't been frozen by a call to
RL_freeze. If the agent has been frozen, no learning or random
behaviour should be carried out. Therefore, if the agent has been
frozen we pick our action greedily in this example, as opposed to
epsilon-greedily. If you choose to fully implement agent_freeze(), you
will have to keep this in mind when writing your own agent_start and
agent_step.
In agent_step, a new action is chosen and then, if
the agent hasn't been frozen, the
value function and the policy are updated.
If the task is episodic, agent_end will be called at
the end of every episode to allow for the last value function and
policy updates. If the task is
not episodic you can leave this function
empty.
Agent_cleanup deallocates all the memory that was
set up in agent_init. Note that every call to agent_init should have a
corresponding call to
agent_cleanup.
A call to agent_freeze should halt any learning the
agent is doing, as well as remove any randomness from the agent's
policy. Agent_freeze is around to allow for training and testing
phases. Typically an agent will train for some period, freeze it's
value function and policy, and then "test" by running the agent through
the environment and gathering results.
Agent_message can be used to do almost anything not represented above.
A more detailed description of this more
personalizable function is found in the more
detailed guide.
The
Environment
The following discussion is based on this sample pseudo
code which is modeled after this Mines Environment.
The details of implementation have been hidden to avoid
getting caught
up in minor memory management details etc. The non-RL-Glue functions
are named after what these
portions of the code should be doing, no details are provided however
it should be
apparent where the corresponding code lies within the Mines
Environment.
No matter the language, you must include the
RL_common file related to
your language in your code. For example, in C/C++ you must #include "RL_common.h" in your
Environment file.
In this example, env_init has two priorities: allocate memory for
necessary structures (such as an Observation or an Action) and generate
a task_specification string. In other examples, the
representation of the environment may also need to have values
initialized in
env_init. A call to env_init is done on a per trial basis as opposed to
a
per episode basis, therefore values which should persist over episodes
should be
initialized here, alternately values which need to be set per episode
should be initialized in env_start. Finally env_init should return the
task_specification string as described in the documentation.
Env_start creates the initial observation in the
environment. In some environments this may be random, while others may
have the same initial state for every episode. A copy of the
previous_observation is saved so that it may be used to generate the
next observation.
Env_step takes a step in the environment and returns
the reward earned and the next observation. In some languages, which do
not allow returning more than one value from a function, some sort of
struct/object is provided in the description of the RL_common file.
Env_cleanup deallocates all the memory that was
allocated in env_init. A call to env_cleanup is made for every call to
env_init.
This is the basic functionality you should need for
a simple experiment. Descriptions of how to use env_set_state,
env_get_state, env_get_random_seed, env_set_random_seed, and
env_message are in the more detailed guide.
The Experiment
The following discussion is based on this sample
pseudo code which is modeled after this Experiment.
The details of implementation have been hidden to
avoid getting caught
up in minor memory management details etc. The non-RL-Glue functions
are named after what these
portions of the code should be doing, no details are provided however
it should be
apparent where the corresponding code lies within the sample Experiment
code. One thing
to note is that it is only the RL_glue functions available to the
experiment program (the already
implemented RL_Glue interface defined here
,
these functions are all of the pattern: RL_<functionname>). No
agent or environment implemented functions should be directly accessed
by the experiment program.
No matter the language, your experiment program must
include/import the functions in RL_Glue. In some
languages, this will require the equivalent of a header file
which lists all
the RL_Glue functions. Click here for
details for your language.
You must have a main() function in your
sample_example (or whatever the equivalent to a "main" function in your
language is) as the Experiment Program is where the execution of the
learning experiment begins.
Each trial is comprised of four basic steps: 1)
Initialize agent and environment 2) Run an episode 3) gather data 4)
cleanup the agent and environment. In this example there is only one
trial, and therefore only one call to RL_init and RL_cleanup, however
if
you want many trials it is important to call RL_init and RL_cleanup
each time to allow the agent to reset it's value function etc.
In this example, RL_episode is called using 0 as
it's parameter. 0 is a special input telling the Glue to allow the
agent to go on forever or until it reaches a terminal state. If you
want to ensure your agent is not allowed to wander too long you can put
an
integer maximum number of steps in here. If you put in 1000, the agent
would be stopped as if it had reached a terminal state after
1000 steps and the next episode would be allowed to start. RL_num_steps
and RL_return are two functions which can be used to learn about how
the agent performed. RL_num_steps returns the number of steps in the
most recent episode (if you are in the middle of an episode, it will
return the number of steps taken so far) and RL_return returns the
return(total reward) for that
episode.
If we had wanted a training and then testing period,
we could have run RL_episode for a long time to train, called RL_freeze
to halt learning and exploration,
and then run RL_episode again for a testing period and collected data
on the fully trained agent's behaviour.
For more details on the other auxiliary RL_Glue functions check the more detailed guide.