 |
Reinforcement Learning and
Artificial
Intelligence (RLAI)
|
Writing Agents, Environments and Experiments in RL-Glue
|
Edited
by Leah Hackman
The ambition of
this
web
page is to provide instructions on how to get started with RL-Glue as
well as provide a guide for starting your first RL-Glue experiment and
writing all the components for it.
If you haven't already, you will want to start by downloading the most recent
update of
RL-Glue.
There are three main ways which you can use RL-Glue to set
up your experiments: direct, socket/network, and mixed. A direct
approach means
the agent, environment and experiment are all written in C or C++
and
all three pieces are compiled together with the RL-Glue code into
one executable to run your experiment. A socketed approach allows
for a plug-and-play type feel, where each portion can be written in a
different language and then compiled separately. The executables for
your agent, environment and experiment program can
even be compiled on different machines as RL-Glue now supports sockets
over networks. The final
possibility is a
mixed approach where certain
pieces have been written in the same language and compiled together,
while others are compiled separately, possibly in different locations.
Below are
instructions for how to write for these different situations. As
the direct environment is for C/C++ code in particular, the discussion
below will be much more code oriented with detailed examples. When
looking at the socketed/networked and mixed environments, a more
fleshed out
abstract
approach will be taken to show the simplicity of the RL-Glue
agent/environment/experiment interfaces and the versatility this will
add to your code. Details about compiling for a
mixed approach will be included in the socketed compilation discussion.
Working with a Direct Approach
Writing an Experiment
Program
Writing an Environment
Writing
an Agent
How to Compile and Run
your Experiment
C Specific
Utilities
Working with a Socketed or Mixed Approach
Writing an Experiment
Program
Writing an Environment
Writing
an Agent
How to
Compile and Run
your Experiment
Running your
Experiment over the Internet/ A Network
RL-Glue Variable Type Definitions and
Important Files for each Language
One change in RL-Glue 2.0 is the standardization of
all parameter types. The Observations, Actions, Random_seed_keys, and
State_keys are all the same. In every language they will have one thing
in common: They contain a list of integer values and a list of double
values. Some, like C, will contain a counter for the number of ints in
the intArray and doubles in the doubleArray. Others, like Java, may not
need this extra information. The above link contains the data types for
each of the support languages of RL-Glue 2.0. It also discusses which
provided files must be imported or connected to your own code for a
given language.
Working with a Direct
Approach
Browsing over
the RL Glue Interfaces before
continuing is advised. Should the Interface and the explanations below
still leave questions or ambiguities, please feel free to contact us
(by extending the FAQ page). Another way to
learn to use RL-Glue is to look
at some of the sample agents/environments/experiments in the library. The
following is assuming a C implementation in a direct hook
up scenario.
Writing
an Experiment Program
Usually the shortest and easiest
part of the experiment to complete, the experiment program has no
interface to implement and is mostly comprised of calls to the already
existing RL-Glue
functions. The experiment program has four main
duties: a) start the experiment b) specify how long/how many
times to run the experiment c) extract data and possibly analyze d) end
the experiment and clean up. One thing to note is that it is only
the RL_glue functions available to the
experiment program (the already
implemented RL_Glue interface defined here
, these functions are all of the pattern: RL_<functionname>).
No agent or environment implemented functions should be directly
accessed by the experiment program.
To start, the
experiment_program.c
(this name is arbitrary) must include RL_glue.h and have a main
function. The simplest Experiment Program you can write, without
extracting any data, is as follows:
#include
<stdio.h>
#include
"RL_glue.h"
int main (int
argc, char *argv[]){
/* this
experiment program runs one episode.
*/
RL_init();
/*this line calls agent_init and env_init to
*let
the agent and environment create and
*initialize all resources necessary*/
RL_episode(0); /*this runs one
episode
of the experiment.
*The
argument of 0 allows the episode
*to run until a terminal observation is reached*/
RL_cleanup(); /*calls
agent_cleanup and env_cleanup to
*release all resources allocated.*/
return 0; }
However
agents need experience to learn and one episode in an episodic task
will give an episodic
agent very
little to work with. An example of a more likely experiment program is
one which runs multiple trials with multiple episodes per trial. Also,
an example program will likely require retrieving results to analyze
how well an agent has performed. The following is actual code written
to test an agent on a gridworld:
#include
<stdio.h>
#include
"RL_glue.h"
int
main (int
argc, char *argv[])
{
/*a
benchmark that runs the agent through the maze
environment
1000 episodes per
trial and repeats this
100 times. I
want to measure the
success of my agent
by testing how
many steps on average
it took to get
through the
maze, as well as
determine how many steps
it took the
agent at the end of
each trial */
double
episode_performance, total_performance = 0.0;
int
episode_count, trial_count = 0;
int
number_of_episodes = 1000;
int
number_of_trials = 100;
int
max_steps = 100; /*Using a small grid I
chose the max_steps
to be quite small when testing*/
for(trial_count
= 0; trial_count< number_of_trials; trial_count++)
{
RL_init();
/* Calling RL_init at
the beginning of each trial
resets the value function for each trial*/
for(episode_count=0;
episode_count<number_of_episodes; episode_count++)
{
RL_episode(max_steps);
episode_performance
+=
RL_num_steps()*(1.0/number_of_episodes);
}
total_performance
+=
episode_performance*(1.0/number_of_trials);
episode_performance
= 0.0;
RL_cleanup();
/*It
is important to call RL_cleanup for every time you call RL_init to
de-allocate resources properly*/
}
printf("the
agent takes %f steps on average\n", total_performance);
return
0;
}
In this
experiments, we gather data over 100 trials which include 1000
episodes. Note that a call to RL_init
and RL_cleanup are done at
the begining and end of each trial. A call to RL_init will clear the value
function and any other values that need to be reset with each trial. If
you do not call RL_cleanup
after each call to RL_init
you may wind up with memory leaks in your experiment.
New to the above example is
RL_num_steps. This function provides the number of steps at the
end of each episode. The
other basic
RL_Glue
function for performance
evaluation is RL_return(). This function will
return the return value from the last run episode. Functions like
RL_freeze,
RL get/set state, RL
get/set random seed also exist to allow
users to analyze the agents actions in particular situations, however
the use of these functions varies according to the users needs and the
users own implementation of the complementary user functions like
agent_freeze, env_get_state etc. Please refer
to the RL Glue Interfaces for more details on
these functions.
Any need to send information or gather
information from the environment and agent should be handled through RL_agent_message or RL_env_message. The details of what
can be passed in and out of the agents and environments are dependent
upon the user written agent and environment code. For a more in
depth discussion see agent_message
and env_message
below.
Lastly, you may have noticed RL_start
and RL_step were used
in RL_episode. The experiment
program author has access to these as well. Should the experiment
program need to print out a trace of the actions taken, the
observations seen, or the rewards given, using RL_start and RL_step instead of RL_episode will provide access to
each observation, reward, and action per step.
Writing
an Environment
Often the easiest
place to
start is to make a list of what needs to be done. There are two things
absolutely necessary to write an environment: all functions from the RL Glue Interface must be defined
(this
is different than RL-Glue 1.0) and the environment code must include
RL_common.h. Accordingly, the following header file would be
a good place to start. Again, the choice of environment.h as a
name was arbitrary.
#ifndef
Environment_h
#define Environment_h
#include
<RL_common.h>
Task_specification
env_init();
Observation
env_start();
Reward_observation
env_step(Action a);
void env_cleanup();
void
env_set_state(State_key sk);
void
env_set_random_seed(Random_seed_key rsk);
State_key
env_get_state();
Random_seed_key
env_get_random_seed();
char* env_message(char *);
#endif
Following this, the next best place to
start is to decide how to represent the states and actions for
the environment. One change from RL-Glue 1.0 is the
standardization of the observation and action types. Both observations
and actions are now required to be of the following form (which can be
found here along with all other data type
requirements).
typedef struct
{
unsigned
int numInts; /* the number of ints in the int Array*/
unsigned
int numDoubles; /* the number of doubles in the double Array*/
int*
intArray;
double*
doubleArray;
} Observation; /*
the same definition applies for actions*/
Now the representation of the state
information and the actions must be pared down to a list of ints and/or
a
list of doubles. For example, an observation struct for a grid world
could have two ints: one for each the x and y co-ordinate of the agent.
The action struct in such a situation could be a number between one
and four, representing north, east, south, and west. Determining how
best to condense state and action information into a succinct numeric
representation is a skill that comes with practice.
Another factor to consider when
beginning to write an environment is how to capture the state
transition function and the reward function. In a small state space,
a case statement where the states are enumerated may suffice, or, in
a
continuous problem, it may be more appropriate to use a function of one
or two factors
representing the environment. For example, in the mountain car
environment the state may be represented by distance from the goal and
velocity, and the transitions are calculated by a function of the
two state factors.
Once the representation for the
observations and actions has been established, writing the task_specification
is a cakewalk. A quick note to make is that the third and forth part of
the task_specification should reflect the int and double arrays from
the
observation and action struct respectively. If no memory or other
resources are required by the environment, returning the
task_specification is
the only required functionality of env_init.
If
any variables need to be initialized ever trial (as opposed to every
episode), these details should be
implemented in
env_init as well. After
writing env_init, env_cleanup
is a natural
progression. Env_cleanup's
job is the opposite to env_init:
free all
the resources allocated throughout the environment.
Now that env_init has set up
the environment, you will need to write env_start to start it. The only
duty
of env_start is to choose a
start state and return the observation
struct which represents that state. After env_start is called, the action,
obseravtion, reward cycle between the agent and environment begins. The
agent takes the start state, chooses and action, and then awaits the
environment to specify what the consequences are. These "consequences"
come from env_step. In env_step, the environment accepts
the agent's action and uses it to determine what state to transition
to, and what reward was earned as a result of taking that action. As
such, it is aparent that env_step is where the
implementation of the state transition and reward functions belong. To
implement this, information
about the previous stat is most likely necessary, so it is a good idea
to store this information from step to step. One important thing to
notice about env_start and env_step is that while env_start returns an
Observation, env_step returns
a Reward_Observation struct. When a
data type or struct appears that is unfamiliar the first place to look
is this page about RL_common.h. The
definition from RL_common.h is as follows:
typedef struct Reward_observation_t
{
Reward r;
/* typedef double Reward*/
Observation
o;
int
terminal; /* 0 for false, 1 for true */
}
Reward_observation;
Normally the variable terminal is set to 0. Changing terminal to 1 signifies terminal
state has been reached and RL-Glue will then proceed to call
agent_end and cleanup. Remember to set the terminal variable if the
terminal state is reached. Again, the best place to get a feel
for what these functions should look like is in the examples provided
in the library.
The following is a simple sample env_step function:
Reward_observation
env_step(Action a)
{
/*NOTE: In functioning C Code you cannot do:
o1 = o2 with structs.
Similarily, struct comparison
cannot be doing using the == operator.
To avoid requiring specific
knowledge about the observation and action
representation for this example,
real struct copying and comparison have been
omitted. */
Reward_observation ro;
Observation
next_observation;
Reward next_reward;
next_observation =
compute_next_state(a,old_o);
next_reward =
compute_reward(a,old_o);
ro.o = next_observation;
ro.reward = next_reward;
if(next_observation == TERMINAL)
ro.terminal
=1; /* TRUE*/
else
ro.terminal = 0;
/*FALSE*/
old_o = ro.o
return ro;
}
The
function compute_next_state and compute_reward are
placeholders for the environment's reward function and state transition
function which are up to the environment author to design. TERMINAL and
old_o in this case are previously stored values
which hold the terminal observation and the previous observation in
this environment.
Should the functionality of saving state or random
seed's be required, there is one detail which should be observed. State
key's and random seed key's are stored in the same abstract type as
observations and actions. They can be represented in any way desired,
whether that be a hash table in the environment returning a key or
compressing all important information into
a unique collection of doubles and ints, however once env_cleanup is
called, all memory of that key should be removed.
Looking
at the RL-Glue functions, it
is easy to see which environment functions are called by RL-Glue
functions, and so only those functions being called need to be
meaningfully implemented. For example, should the experiment program
never
call RL_get _state, then env_get_state can return a dummy
State_key. To run a useful experiment, the base layer of functionality
of RL-Glue is minimally required. This includes: env_init, env_start, env_step,
and env_cleanup. The other
functions can be left as stubbs of code, where the body of the
functions are empty and return values are meaningless. As your
experiment requires the more specialized functions like RL_get_state and RL_env_message, you may
choose to embellish your implementation of these functions. If your
environment is being publicly released, or used in a competition, it is
prudent to
fully implement all the functions according to the RL-Glue
specification.
If any desired
functionality is missing in the environment functions, it can easily be
accommodated using the env_message
function. The env_message has no required functionality. When an
experimenter requires the ability to change an element of the
environment mid experiment, or be able to gather data from the
environment, env_message is
the solution. For example, if the goal is to test an agent in a
changing environment, after a few episodes a message could be passed
through RL_env_message telling the
environment to add a wall to a gridworld or change the physics of the
mountain car environment. Alternatively, the env_message functionality could
be used to gain information about the current state of the environment
(given that the requested information can be conveyed in a string). The
functionality provided by env_message is up to the author
of the environment.
Writing
an Agent
As
with the environment, the only two major restrictions on an agent is
that it implements all the necessary functions as described in the RL Glue Interface, and include RL_common.h. The
following
is
the basic starting header file (the title Agent.h is entirely
arbitrary) for an Agent to provide a starting point:
#ifndef Agent_h
#define Agent_h
#include
<RL_common.h>
void
agent_init(Task_specification task_spec);
Action
agent_start(Observation o);
Action
agent_step(Reward r, Observation o);
void
agent_end(Reward r);
void
agent_cleanup();
void
agent_freeze();
Message agent_message(Message);
#endif
The
purpose of agent_init is
to allocate any memory/resources required
by the agent as well as initialize any values which need to be reset at
the beginning of every trial/run. Values which should persist across
episodes can be initialized here, values which need to be initialized
each episode should be set in agent_start.
An example of something to initialize in agent_init is your value function. The task_specification recieved from the
environment as a description of the task and action/state space. The
agent should parse the
task_specification string to learn what observation and action spaces
are for the environment and based on this new knowledge, set up an
appropriate value function. A parser already exists in both C and C++
in the RL-Glue Utils folder (click for
more details). For example, in the case of a 3X3
gridworld, a sample task_specification would be 1:e:2_[i,i]_[0,2]_[0,2]:1_[i]_[0,3].
This
task spec tells the agent that the world is represented by 2 integers
(the x,y axis),
both of which are between zero and two, (ie a 3x3 gridworld) and that
it can take one of four actions, represented by the numbers zero
through three. Having received this information, the agent could then
set up a 3X3X4 array to contain the state action values. Again,
agent_cleanup is closely
related to agent_init in that
everything agent_init
creates,
agent_cleanup
should
destroy.
Agent_start
is called once at the beginning of each episode as the first step of
the episode. Given the first_observation (coming from
env_start), the agent must
decide what action to take based on the agent's initial policy. As
there is no reward for the first step, no updates must be done. In agent_step, the agent must
again
consult it's policy to determine what step
to take, however the agent should also perform any value function
updates/policy improvement required for learning. The parameters
passed in with agent_step are
the reward received for the previous
action, and the current observation, thus it is likely helpful to save
the previous action and observation for learning updates. An
example of an agent_step
method for a tabular sarsa agent in a
gridworld is given below:
Action
agent_step(Reward r, Observation o)
{
/*NOTE: structs cannot
be copied in a "s1 = s2" fashion. Also arrays are obviously not indexed
by structs.
The following
code uses these simplifications for clarity and brevity. */
action = egreedy(o); /* a function to greedily pick the highest
* valued action most of the time,
*and pick a random action epsilon percent of
the time*/
value[previous_observation,
previous_action] += alpha*(r + gamma * value[o, action] -
value[previous_observation, previous_action]);
previous_observation = o;
previous_action = action;
return action;
}
If the
environment is episodic, the agent_end
method will be called to allow
for the experimenter to complete the last step of his/her learning
algorithm.
Agent_freeze is a new function
to RL_Glue 2.0. Agent_freeze
should
freeze the agent's policy and value
function so that the agent is no longer learning and is behaving
consistently. One easy mistake is to forget any randomness in a policy.
If an agent is implementing an epsilon greedy policy as above, the
agent will have to remove the epsilon randomness after agent_freeze is
called.
Almost anything
that isn't covered by these functions can be attempted using the agent_message
function.
There are no guidelines for agent_message other than it take
in a string and spit out a string. Perhaps the input string could be
some signal to change the current alpha value of the learning parameter
or to request the agent return a string representation of the value
function. Either way, it is up to the experimenter to define agent_message according to
his/her own needs.
How
to Compile and Run your Experiment
In a
direct
environment,
compilation of the RL-Glue code is done every time a new or changed
agent, environment or experiment program is introduced. The process is
not difficult, however while working on a project it can grow tedious
to enter the compilation commands repeatedly and can leave room for
errors. To help alleviate these issues, and make compilation even
simpler, a makefile is suggested. Sample makefiles are included
with the RL-Glue software with the different example experiments. As it
is best to learn by example, and
learning to write makefiles is beyond the scope of this page, we will
only discuss modifying these makefiles to work with your code. The
makefile described below is found with the direct project in your
Examples directory.
Firstly, this example makefile makes some assumptions about
the locations of your files. The first assumption is that you have not
moved your examples folder at all. If you have moved your examples
folder, this is easily fixed by changing the RL-GLUE variable in the
makefile. This line is as follows:
# Path to RL-Glue to find RL_glue and
related files.
RL-GLUE = ../../RL-Glue
You must replace ../../RL-Glue with the correct path of your RL-GLUE
directory. In general, ./ denotes to look in the current directory
(where the makefile is held), ../denotes to look up one directory (and
../../ would look up two etc), and ../Code/RL-Glue would look up one
directory and then go into the directory named Code to find the RL-Glue
directory (For a more detailed explanation of traversing a file system,
please look up a more detailed guide).
The second assumption is that your Agent, Environment and Experiment
are in the src directory in the same directory as the makefile. If you
have moved them, you will have to change the following lines:
# Compile our agent, environment, and
experiment
$(OBJECTS): %.o: src/%.c
$(CC) -c $(CFLAGS)
$< -o Build/$@
If you have moved your
agent, environment and experiment program code to the same place,
replace src/ with the path
for the objects. If you wish to put your agent, environment and
experiment program in separate locations you will have to write a
target (the above is a target for all three) for each piece. For
example, if we created an agent directory, an environment directory and
an experiment program directory we would change the above to look like
this:
# Compile our agent, environment, and
experiment
Agent.o: Agent_Folder/Agent.c
$(CC) -c $(CFLAGS)
$< -o Build/$@
Environment.o:
Environment_Folder/Environment.c
$(CC) -c $(CFLAGS)
$< -o Build/$@
Experiment.o:
Experiment_program_folder/Experiment.c
$(CC) -c $(CFLAGS)
$< -o Build/$@
If
you haven't moved your files but wish to change the name of the agent,
environment or experiment (I imagine this would be a popular option, as
the whole world may not want to name their environment mines or their agent SarsaAgent) you must change the
following line:
OBJECTS = SarsaAgent.o mines.o
experiment.o
Note that these are the
object files, not the C/C++ files. You don't need to put in the C/C++
files, only the name of the object file (However, if you are changing
the build commands, your object files must
have the same name as your C/C++ files: ie MyAgent.cpp becomes
MyAgent.o etc).
If you need to compile other files that are in the same directory as
your agent, environment and experiment you can add the name of their
object file (File.c becomes File.o) to the OBJECTS list. If they exist
elsewhere, you will want to compile them separately (like we did above
when the environment, agent and experiment were not in the same folder)
and add them to the list of things linked in with RL_glue (as done in
the below example in bold).
# Link our objects into RL_glue
RL_glue: RL_glue.o $(OBJECTS) File.o
$(CC) -o $@
Build/RL_glue.o $(addprefix Build/, $(OBJECTS)) Build/File.o
The only other portion of this makefile that should
be changed is the compiler flags. To change the compiler and compiler
flags being used
you can change the following lines:
#
Compiler flags
CC = gcc
CFLAGS = -I$(RL-GLUE)/ -ansi -pedantic -Wall
LDFLAGS =
If your code is in C++, remember to change the
compiler, denoted by the CC variable, to g++ or your own preferred C++
compiler. The given compiler flags are named CFLAGS. In our example we
use 4 flags, I will briefly discuss why they are in this example.
-I$(RL-GLUE) is slightly confusing as it is a fusion of a compiler flag
and a
makefile command. The -I
is a compiler flag, telling the compiler to look in another
directory for any missing files. The -I should be followed by the
directory to look in, however in this case it is followed by a
makefile variable which may look confusing. The $(RL-GLUE) inserts
whatever is
stored in the RL-GLUE variable, so in reality the statement
-I$(RL-GLUE) really says -I ../../RL-Glue. This lets the compiler see
RL_common.h and RL_glue.h, which your agent/environment/experiment
program need to compile. If you need to access other files, such as the
Glue Utilities, you should add another -I flag to your CFLAGS pointing
to where your other files reside. The ansi,
pedantic, and Wall flags are requesting varying levels of warnings from
the compiler. These can be harmlessly removed, however they help you
write safer code (for more details look up
the man pages for gcc). In this example there was no need for linking
flags, denoted by LDFLAGS, however you can add your own.
After saving any changes and quitting your makefile
you should be able to type make
in the directory with your makefile and it will create an executable
named RL_glue. This executable, depending upon the example, will either
appear in the working directory or in a bin directory within your
working directory. The RL_glue executable is your experiment, just type
./RL_glue (or ./bin/RL_glue if the executable was
put in a bin directory) to run and wait for your
results! One other useful things to know: if you want to get rid of the
.o
files you can type make clean.
make tidy will remove old
saved versions of your code (ie the agent.c~ files created by emacs ).
Working
with a Socketed or Mixed
Approach
Writing the agents/environments/experiments
with a socketed
approach has no added restrictions or protocol changes in comparison to
a direct
approach. The only real difference is the compilation process involved
and
the advantage of being able to use different languages for each piece
(which of course comes with the features and fallbacks of the various
languages as well).
It is entirely possible to take a C agent, environment, and experiment
which were previously all directly connected together , and compile
them to run with sockets without any changes being made to the
agent/environment/experiment files. That being said, the following
discussions are provided as a more abstract discussion about
approaching writing agents, environments and experiments with less
emphasis on implementation details than above. Language specific
details can be located on the RL_common
page. Lastly, we
will discuss how to compile your files to run with a socketed approach.
Writing an
Experiment Program
As was mentioned
above, the most important thing to keep in mind when
writing an Experiment, Agent or Environment is knowing the RL Glue Interfaces.
In any language, the RL-Glue functions are available to the experiment
program and the agent/environment functions should not/cannot be
directly accessed. The duties of the experiment program are still the
same as described above: a) start the experiment b) specify how
long/how many
times to run the experiment c) extract data and possibly analyze d) end
the experiment and clean up.
The simplest
experiment program to write runs exactly one episode and
consists of only three function calls: RL_init(), RL_Episode(int), and RL_cleanup(). Most variations of
the experiment program are simple loops of this form. A major detail to
keep in mind is that for every call to RL_init, a corresponding call to
RL_cleanup must be made.
Typically experiments will require many trials/runs consisting of many
episodes. This is done by adding loops in the appropriate spots
in the above basic experiment program. The basic form is as follows:
for(number of trials)
RL_init()
for(number of episodes per trial)
RL_Episode(int)
RL_cleanup()
The most pivotal
detail to note is that RL_init and RL_cleanup are only called on a
per trial basis, not a per episode. This is because a call to RL_init will reset the value
function and policy.
While the above
three functions will handle agent learning, to evaluate
the agent's progress knowledge of the RL-Glue helper functions is
necessary. The RL_return and
RL_num_steps functions will
help gain information about the agents performance to evaluate their
performance. The most likely RL_freeze
enables the experiment program to freeze an agent's policy so that the
experiment can consist of a period of training and a period of testing
to evaluate the final value function and policy. RL_get_state, RL_set_state allow
the experiment program to capture the state of the environment and
reload it later. This does NOT affect the agent's policy however. One
use would be to save a particularly difficult position in the
environment and replace the agent in that situation later after much
training to see if the agent's performance has improved. RL_get_random_seed
and RL_set_random_seed will
save and restore the "random" numbers being used by the environment to
simulate any randomness in the environment's state transition/reward
functions. Note that you will likely want to initialize the random seed
at the beginning of the environment program. If the experiment
program's goal is to train the agent on a
particular string of events repeatedly, the random_seed functions can
be used to ensure "consistent randomness" over the events.
RL_env_message and RL_agent_message can be used to
talk to the environment or agent. This may be done to change internal
parameters or to extract data from the environment/agent which cannot
be done through any of the other RL functions. For more details check
out env_message or agent_message.
Lastly, there are
two functions, RL_start and RL_step which are used in RL_episode, however the user has
access to these as well. Should the experiment program need to print
out a trace of the actions taken, the observations seen, or the
rewards given, using RL_start and
RL_step instead of RL_episode will provide access to
the required data. The data type returned by RL_start and RL_step is provided with the RL_common information.
Writing an Environment
One of the changes
in RL-Glue 2.0 is that all functions in the
interface must be defined. For the Environment, this means writing the
following nine functions:
Task_specification
env_init()
Observation
env_start()
Reward,
Observation, Terminal env_step(Action)
void env_cleanup()
void
env_set_state(State_key)
void
env_set_random_seed(Random_seed_key)
State_key
env_get_state()
Random_seed_key
env_get_random_seed()
Message env_message(Message)
Depending upon
which language is being used, the true underlying data
structures for the above data types (such as the Observation, Reward,
Task_specification etc.) will change. For each language RL-Glue
currently works with, it comes with a RL_Common file (click here to find details on your language) with
the
definitions of these data types. For example, when writing in C one
would use the RL_ common.h, or when using python one would choose
RL_common.py. It is imperative to use the defined types in the
RL_common file. In C, the header file (RL_common.h) must be #included by the environment file.
For each language there will be an analogous import action, such as
python's include
script_name.
The best place to
start when writing an
environment is deciding how to represent the state of the environment.
According to the RL_common definition of Observations, a state must be
represented in such a way that it can then be represented by a
collection of ints/doubles. Internally, one may choose to represent a
state with any data type you choose, be it a string, an int, or even
images. The caveat to this is that
there must be a way of extracting a representation entirely of ints
and/or
doubles from the state to pass out of the Environment. An example of
this is the cat
and mouse game. The environment could internally keep a map of the
environment, knowing where the edges and traps and walls are in any
manner it likes just so long as it is able to relay the position of the
mouse and the position of the cat to the agent through ints and
doubles. One implementation may be to enumerate all the possible
combinations of where the mouse and the cat are and return the single
int to represent the state Observation. Another possibility
is to return the x and y coordinates of both the mouse and the cat.
Determining how to represent the Observation is a key observation to
make.
After determining
how to represent the current state of the
environment, the next step is to determine what the state transition
function and the reward function
will be. These can be highly explicit, where each state and
action pair is stored in a table with a corresponding entry for the
next state or reward. Others, where the state/action space is much too
large or is continuous, can be based on some calculation on the current
state/action values. For example, in mountain
car the state can be represented by the distance from the goal and the
current velocity and therefore the next state can be calculated by some
combination of the two and the next action.
Env_init
provides a place to initialize any data structures necessary to
represent the environment and also provides an opportunity to
initialize any data which must persist between episodes (recall an
episode is a sequence of RL_start,
RL_step). Lastly env_init
must return the task_specification.
This is a
string (or string-like data structure, you can check the RL_common page for a
particular language) which specifies the details of the
Observation and Action space. The protocol for writing one is found
here: task_specification.
Env_start is
the official start of the "experiment". Env_start initializes any values
which need to be set on an episodic basis as well as returns the
initial
Observation based on placing the agent in the start state. After
env_start
is called, the action-observation-reward cycle between the agent and
environment begins. The agent takes the start state, chooses an
action, and then waits for the environment to specify what the
consequences are. These "consequences" come from env_step. In env_step,
the environment accepts the agent's action and uses it to determine
what state to transition to, and what reward was earned as a result of
taking that action. One thing to note is that env_start
returns an Observation, where as env_step
returns the reward, the
Observation, and the "terminal" variable (again, determining how to
return this for your language is critical. In some languages, these
three elements must be put into a wrapper due to their "one return
value" policy. Check the RL_common page).
If the agent enters into a
terminal state (in the episodic case) the terminal variable is set. In
some languages this is done using an int set to 1, in others a boolean
may be used. Again, it is important to check RL_common
for the data
type used for the language of choice.
Env_cleanup is
called at the end of each set of episodes. Every time env_start
is called there should be a corresponding call to env_cleanup. Env_cleanup is used to de-allocate
any memory allocated in env_start.
All the above environment functions are the basic core group of
functions required by RL-Glue. The
rest of the environment functions must be defined, but can be empty
functions or functions which return dummy values if they are not
needed.
Env_set_state and
env_get_state are
available to allow the experimenter to capture the environment in a
state and reload that state later in the experiment. The captured state
does not include the agent's value function however, so upon returning
to that state later the agent may behave differently. One use for the
env state functions may be to focus training in a region by continually
replacing the agent in a certain portion of the state space.
Env_get_random_seed
and env_set_random_seed are
similar to env_get_state and
env_set_state in their ability
to
help capture the state of the environment. If an environment has some
randomness in it's behaviour the random_seed functions can be used to
"save" and "reload" the randomness to regenerate the same sequence of
events. For example, if an agent was behaving oddly on a particular
sequence of state transitions, by getting the state and the random
seed, one could reset the state and random_seed over and over to
generate experience on that particular sequence of events. Also it is
often a good idea to set the random seed from the experiment program at
the beginning of the learning experiment.
If any functionality
is missing in the environment functions, it can easily be accommodated
using the env_message
function. The env_message has no required functionality. When an
experimenter requires the ability to change an element of the
environment mid experiment, or to gather data from the
environment, env_message is
the solution. For example, if the goal is to test an agent in a
changing environment, after a few episodes a message could be passed
through RL_env_message telling the
environment to add a wall to a gridworld or change the physics of the
mountain car environment. Alternatively, the env_message
functionality could be used to gain information about the current state
of the environment, given that this information can be conveyed in a
string. The functionality of env_message is up to the author
of the environment.
Writing
an Agent
Just like the environment,
the agent must implement all of the agent functions in the RL-Glue
interface and follow the type definitions
for Observations, Actions, rewards and all the other input/output
variable types. The following seven functions must be defined in agent
code
(though if they are empty and/or return empty values RL-Glue will not
complain):
void
agent_init(Task_specification)
Action
agent_start(Observation)
Action
agent_step(Reward , Observation)
void
agent_end(Reward)
void
agent_cleanup()
void
agent_freeze()
Message agent_message(Message)
The purpose of agent_init
is very similar to that
of env_init. Any memory or
resources that are required throughout the agents code should be
allocated and initialized in env_init.
Also, values which must be reset every trial (but persist through each
episode) should be initialized here. Within agent_init you can parse the
task_specification to discover more details about the action and state
space. Based on this knowledge the agent can determine how to store
it's
value function. In the case of small discrete action and
observation spaces, a tabular value function may be stored in an array
like structure where as a continuous space may require function
approximation or some other approach. Determining how to store the
value function will affect what memory is allocated and initialized.
Every episode
takes
it's first step from a call to agent_start. Given the
first_observation (coming from
env_start), the agent chooses
an action according to it's initial policy. No learning is done in agent_start as no rewards have been
earned.
In agent_step the agent again
selects an action based on the current policy but it should also
perform value updates and policy improvements required for learning.
The parameters
passed in with agent_step are
the reward received for the previous
action, and the current observation. To perform useful learning updates
it will likely be helpful to save
the previous action and observation.
In an episodic
environment, RL-Glue will call agent_end
when the agent enters a
terminal state. A call to agent_end
allows the agent to do any final updates to the value function and
policy. At the end of a trial agent_cleanup
will be called to allow for the deallocation of any resources created
in agent_init. Calls to agent_init and agent_cleanup
should always be in a one to one ratio.
Two new functions to
RL-Glue as of version 2.0 are agent_freeze
and agent_message. \A call to agent_freeze should
stop all learning in the agent and also inhibit random moves by the
agent. This allows for a distinct training and testing phase for an
agent. Agent_message is a
very generic function open to the authors interpretation. The only
requirements are that agent_message
take in a string and pass out a string. There are no stipulations on
the organization or content of the string. One use could be to allow an
external source to change the alpha learning parameter in a sarsa
agent. The author of agent_message
could specify that any input string of the form "Alpha:x" will change
the alpha value to x. Another example could be if the agent_message author wanted to
view the values for a certain state, an input string with the values of
a state could be passed in and then the correct value returned as a
string. Anything
you can imagine and implement is allowed with agent_message.
How
to Compile and Run your Experiment
In
a socketed
environment, compilation is only done on parts of the code which are
changed. If an
agent changes and nothing else, only the agent needs to be recompiled.
A sample makefile in 5 parts is provided with the example network C
code
and can
easily be adapted to suit your own needs. The 5 parts of the makefile
are: makefile, RL_agent.makefile,
RL_environment.makefile, RL_glue.makefile, and RL_experiment.makefile.
The main makefile should not require any changes, it only
points to all the other makefiles that need to be included. There are
two portions in the four sub makefiles which should be changed in
unison: the location of RL-Glue and the location of the Build path. In
each makefile are two variables: RL-GLUE and BUILD_PATH. For the
makefiles to work, RL-GLUE must have the correct path to the RL-Glue
directory on your computer. In general, ./
denotes to look in the current directory
(where the makefile is held), ../denotes to look up one directory (and
../../ would look up two etc), and ../Code/RL-Glue would look up one
directory and then go into the directory named Code to find the RL-Glue
directory (For a more detailed explanation of traversing a file system,
please look up a more detailed guide). The BUILD_PATH doesn't need to
be the same on all four files, but it is a general good practice. This
can be changed in the same way as the RL-GLUE path.
Another similarity
that exists across all four makefiles in this example is the compiler
and the compiler flags portion. These can be changed for each file to
be different. It appears as such in all four files:
CC =
gcc
CFLAGS = -I$(RL-GLUE)/ -ansi
-pedantic -Wall
LDFLAGS =
If your code is
in C++ you should change the
compiler, denoted by the CC variable, to g++ or your own preferred C++
compiler. The given compiler flags are named CFLAGS. In our example
we use 4 flags, I will briefly discuss why they are in this example.
-I$(RL-GLUE) is slightly confusing as it is a fusion of a compiler flag
and a
makefile command. The -I
is a compiler flag, telling the compiler to look in another
directory for any missing files. The -I should be followed by the
directory to look in, however in this case it is followed by a
makefile variable which may look confusing. The $(RL-GLUE) inserts
whatever is
stored in the RL-GLUE variable, so in reality the statement
-I$(RL-GLUE) really says -I ../../RL-Glue. This lets the compiler see
RL_common.h and RL_glue.h, which your agent/environment/experiment
program need to compile. If you need to access other files, such as the
Glue Utilities, you should add another -I flag to your CFLAGS pointing
to where your other files reside. The ansi,
pedantic, and Wall flags are requesting varying levels of warnings from
the compiler. These can be harmlessly removed, however they help you
ensure your code is safer (for more details look up
the man pages for gcc). There
were no linking flags, denoted by LDFLAGS, needed in this example but
these can be added here.
RL_glue.makefile is probably the most confusing, or at least largest,
of the four sub- makefiles. The rest of this file should not be touched
unless you are going to compile in mixed mode. If you are compiling in
a mixed mode you will need to change a number of things throughout the
other makefiles as well. For example, if we
wanted to compile the agent in with the glue, we would have to change:
OBJECTS = RL_glue.o RL_network.o
RL_server_agent.o RL_server_environment.o
RL_server_experiment.o
by adding our Agent
object files to this list. For example, in this makefile the above line
would become:
OBJECTS
= RL_glue.o RL_network.o RL_server_agent.o
RL_server_environment.o RL_server_experiment.o SarsaAgent.o
Glue_utilities.o
RL_client_agent.o
You also need to remove the following lines from the RL_agent.makefile:
RL_agent: $(OBJECTS)
$(CC) -o $@
$(add prefix $(BUILD_PATH)/, $(OBJECTS))
Lastly, you need to
ensure the Build Path in the RL_agent.makefile and the RL_glue.makefile
are the same, or ensure that the RL_glue.makefile knows where to look
for the agent object files.
If you are sticking to a pure network compilation process, you can read
below to learn how to change the provided makefile in the Network C
example for yourself. If you are interested in a mixed mode, there are
examples of makefiles in mixed compilation mode with some of the other
Examples provided with RL-Glue.
To change RL_agent.makefile to work with your own C/C++ agents you only
need to change all instances of SarsaAgent to the name of your agent.
If your agent requires other files (for example if you wrote your own
helper functions in a separate file), you only need to add the object
file to the OBJECTS variable and then add a build rule for your
new file. For example if your agent needed the file helper.cpp, the OBJECTS
variable should be modified by adding the following bold text:
AGENT_OBJECTS = SarsaAgent.o
Glue_utilities.o RL_client_agent.o RL_network_agent.o helper.o
After
which you would add a build rule for the helper folder like this:
helper.o: helper.c
$(CC) -c $(CFLAGS)
$< -o $(BUILD_PATH)/$@
The same is true of RL_environment.makefile and RL_experiment.makefile:
you only need to replace mines
or experiment with the name
of your own environment and experiment. Other files can also be
included by adding the object file to the OBJECTS variable and then add
a compilation rule for the new file.
To learn how to compile using different languages you can replace the
RL_environment.makefile, RL_agent.makefile and RL_experiment.makefile
with makefiles which compile your environment/agent/experiment in their
own language. The only stipulation is that they compile with the
provided RL_client files for your language. As each
language is supported in RL-Glue, an example including using this
language will be provided to help you develop your own makefile.
After saving any changes and quitting your makefile
you should be able to type make
in the directory with your makefile and it will create four executables
named RL_glue, RL_agent,
RL_experiment, and RL_environment.
Run RL_glue by typing ./RL_glue
into your terminal, you can then connect your
agent/environment/experiment by running the three corresponding
executables in any order you choose. You can connect and disconnect
your agent/environment/experiment as many times as you want, however
once all three have connected the experiment will run through to the
end. Other useful things to know: if you want to get rid of the
.o
files you can type make clean.
make tidy will remove old
saved versions of your code (ie the agent.c~ files created by emacs ).