Home Reinforcement Learning and Artificial Intelligence (RLAI)

Writing Agents, Environments and Experiments in RL-Glue

Edited by Leah Hackman Leah Hackman, June 19, 2007

   The ambition of this web page is to provide instructions on how to get started with RL-Glue as well as provide a guide for starting your first RL-Glue experiment and writing all the components for it.




If you haven't already, you will want to start by downloading the
most recent update of RL-Glue. There are three main ways which you can use RL-Glue to set up your experiments: direct, socket/network, and mixed. A direct approach means the agent, environment and experiment are all written in C or C++ and all three pieces are compiled together with the RL-Glue code into one executable  to run your experiment. A socketed approach allows for a plug-and-play type feel, where each portion can be written in a different language and then compiled separately. The executables for your agent, environment and experiment program can even be compiled on different machines as RL-Glue now supports sockets over networks.  The final possibility is a mixed approach where certain pieces have been written in the same language and compiled together, while others are compiled separately, possibly in different locations.

Below are instructions for how to write for these different situations. As the direct environment is for C/C++ code in particular, the discussion below will be much more code oriented with detailed examples. When looking at the socketed/networked and mixed environments, a more fleshed out abstract approach will be taken to show the simplicity of the RL-Glue agent/environment/experiment interfaces and the versatility this will add to your code.  Details about compiling for a mixed approach will be included in the socketed compilation discussion.

Working with a Direct Approach

       Writing an Experiment Program 
       Writing an Environment       
       Writing an Agent
       How to Compile and Run your Experiment
    C Specific Utilities

Working with a Socketed or Mixed Approach


       Writing an Experiment Program 
       Writing an Environment       
       Writing an Agent
       How to Compile and Run your Experiment

Running your Experiment over the Internet/ A Network


RL-Glue Variable Type Definitions and Important Files for each Language


One change in RL-Glue 2.0 is the standardization of all parameter types. The Observations, Actions, Random_seed_keys, and State_keys are all the same. In every language they will have one thing in common: They contain a list of integer values and a list of double values. Some, like C, will contain a counter for the number of ints in the intArray and doubles in the doubleArray. Others, like Java, may not need this extra information. The above link contains the data types for each of the support languages of RL-Glue 2.0. It also discusses which provided files must be imported or connected to your own code for a given language.





Working with a Direct Approach

Browsing over the RL Glue Interfaces before continuing is advised. Should the Interface and the explanations below still leave questions or ambiguities, please feel free to contact us (by extending the FAQ page). Another way to learn to use RL-Glue is to look at some of the sample agents/environments/experiments in the library. The following is assuming a C implementation in a direct hook up scenario.

Writing an Experiment Program

Usually the shortest and easiest part of the experiment to complete, the experiment program has no interface to implement and is mostly comprised of calls to the already existing RL-Glue functions. The experiment program has four main duties: a) start the experiment b) specify how long/how many times to run the experiment c) extract data and possibly analyze d) end the experiment and clean up.  One thing to note is that it is only the RL_glue functions
available to the experiment program (the already implemented RL_Glue interface defined here , these functions are all of the pattern: RL_<functionname>). No agent or environment implemented functions should be directly accessed by the experiment program.

To start, the experiment_program.c (this name is arbitrary) must include RL_glue.h and have a main function. The simplest Experiment Program you can write, without extracting any data, is as follows:

    #include <stdio.h>
    #include "RL_glue.h"

    int main (int argc, char *argv[]){
        /* this experiment program runs one episode.  */
        RL_init();             /*this line calls agent_init and env_init to
                                      *let the agent and environment create and
                                      *initialize all resources necessary*/

        RL_episode(0);     /*this runs one episode of  the experiment.
                                       *The argument of 0 allows the episode
                                       *to run until a terminal observation is reached*/

        RL_cleanup();       /*calls agent_cleanup and env_cleanup to
                                       *release all resources allocated.*/
        return 0;    }

However agents need experience to learn and one episode in an episodic task will give an episodic agent very little to work with. An example of a more likely experiment program is one which runs multiple trials with multiple episodes per trial. Also, an example program will likely require retrieving results to analyze how well an agent has performed. The following is actual code written to test an agent on a gridworld:

#include <stdio.h>
#include "RL_glue.h"

int main (int argc, char *argv[])
{

/*a benchmark that runs the agent through the maze
environment 1000 episodes per trial and repeats this
100 times. I want to measure the success of my agent
by testing how many steps on average it took to get
through the maze, as well as determine how many steps
it took the agent at the end of each trial */

double episode_performance, total_performance = 0.0;
int episode_count, trial_count = 0;
int number_of_episodes = 1000;
int number_of_trials = 100;
int max_steps = 100; /*Using a small grid I chose the max_steps to be quite small when testing*/
   
for(trial_count = 0; trial_count< number_of_trials; trial_count++)
{

RL_init();               /* Calling RL_init at the beginning of each trial resets the value function for each trial*/
for(episode_count=0; episode_count<number_of_episodes; episode_count++)
{
RL_episode(max_steps);
episode_performance += RL_num_steps()*(1.0/number_of_episodes);
}   
total_performance += episode_performance*(1.0/number_of_trials);
episode_performance = 0.0;
RL_cleanup();       /*It is important to call RL_cleanup for every time you call RL_init to de-allocate resources properly*/
}
printf("the agent takes %f steps on average\n", total_performance);
return 0;
}

In this experiments, we gather data over 100 trials which include 1000 episodes. Note that a call to RL_init and RL_cleanup are done at the begining and end of each trial. A call to RL_init will clear the value function and any other values that need to be reset with each trial. If you do not call RL_cleanup after each call to RL_init you may wind up with memory leaks in your experiment.

New to the above example is RL_num_steps. This function provides the number of steps at the end of each episode.
The other basic RL_Glue function for performance evaluation is RL_return().  This function will return the return value from the last run episode. Functions like RL_freeze, RL get/set state, RL get/set random seed also exist to allow users to analyze the agents actions in particular situations, however the use of these functions varies according to the users needs and the users own implementation of the complementary user functions like agent_freeze, env_get_state etc. Please refer to the RL Glue Interfaces for more details on these functions.

Any need to send information or gather information from the environment and agent should be handled through RL_agent_message or RL_env_message. The details of what can be passed in and out of the agents and environments are dependent upon the user written agent and environment code.  For a more in depth discussion see agent_message and env_message below.

Lastly, you may have noticed RL_start and RL_step were used in RL_episode. The experiment program author has access to these as well. Should the experiment program need to print out a trace of the actions taken,  the observations seen, or the rewards given, using RL_start and RL_step instead of RL_episode will provide access to each observation, reward, and action per step.
   
Writing an Environment

Often the easiest place to start is to make a list of what needs to be done. There are two things absolutely necessary to write an environment: all functions from the
RL Glue Interface must be defined (this is different than RL-Glue 1.0) and the environment code must include RL_common.h.  Accordingly, the following header file would be a good place to start.  Again, the choice of environment.h as a name was arbitrary.

    #ifndef Environment_h
    #define Environment_h
    #include <RL_common.h>

    Task_specification env_init();
    Observation env_start();
    Reward_observation env_step(Action a);
    void env_cleanup();
    void env_set_state(State_key sk);
    void env_set_random_seed(Random_seed_key rsk);
    State_key env_get_state();
    Random_seed_key env_get_random_seed();
    char* env_message(char *);
   
    #endif

Following this, the next best place to start is to decide how to represent the states and actions for the environment. One change from RL-Glue 1.0 is the standardization of the observation and action types. Both observations and actions are now required to be of the following form (which can be found here along with all other data type requirements).

    typedef struct
    {
      unsigned int numInts; /* the number of ints in the int Array*/
      unsigned int numDoubles; /* the number of doubles in the double Array*/
      int* intArray;
      double* doubleArray;   
    } Observation; /* the same definition applies for actions*/

Now the representation of the state information and the actions must be pared down to a list of ints and/or a list of doubles. For example, an observation struct for a grid world could have two ints: one for each the x and y co-ordinate of the agent. The action struct in such a situation could be a number between one and four, representing north, east, south, and west. Determining how best to condense state and action information into a succinct numeric representation is a skill that comes with practice.

Another factor to consider when beginning to write an environment is how to capture the state transition function and the reward function. In a small state space, a case statement where the states are enumerated may suffice, or, in a continuous problem, it may be more appropriate to use a function of one or two factors representing the environment. For example, in the mountain car environment the state may be represented by distance from the goal and velocity, and the transitions are calculated by a function of the two state factors. 


Once the representation for the observations and actions has been established, writing the task_specification is a cakewalk. A quick note to make is that the third and forth part of the task_specification should reflect the int and double arrays from the observation and action struct respectively.  If no memory or other resources are required by the environment, returning the task_specification is the only required functionality of env_init
If any variables need to be initialized ever trial (as opposed to every episode), these details should be implemented in env_init as well. After writing env_init, env_cleanup is a natural progression. Env_cleanup's job is the opposite to env_init: free all the resources allocated throughout the environment. 

Now that env_init has set up the environment, you will need to write env_start to start it. The only duty of env_start is to choose a start state and return the observation struct which represents that state. After env_start is called, the action, obseravtion, reward cycle between the agent and environment begins. The agent takes the start state, chooses and action, and then awaits the environment to specify what the consequences are. These "consequences" come from env_step. In env_step, the environment accepts the agent's action and uses it to determine what state to transition to, and what reward was earned as a result of taking that action. As such, it is aparent that env_step is where the implementation of the state transition and reward functions belong. To implement this, information about the previous stat is most likely necessary, so it is a good idea to store this information from step to step. One important thing to notice about env_start and env_step is that while env_start returns an Observation, env_step returns a Reward_Observation struct. When a data type or struct appears that is unfamiliar the first place to look is this page about RL_common.h. The definition from RL_common.h is as follows:
  
     typedef struct Reward_observation_t

    {
      Reward r; /* typedef double Reward*/
      Observation o;
      int terminal;  /* 0 for false, 1 for true */
    } Reward_observation;

Normally the variable terminal is set to 0. Changing terminal to 1 signifies terminal state has been reached and RL-Glue will then proceed to call agent_end and cleanup. Remember to set the terminal variable if the terminal state is reached.  Again, the best place to get a feel for what these functions should look like is in the examples provided in the library. The following is a simple sample env_step function:

    Reward_observation env_step(Action a)
    {
     /*NOTE: In functioning C Code you cannot do: o1 = o2 with structs.
        Similarily, struct comparison cannot be doing using the == operator.
        To avoid requiring specific knowledge about the observation and action
        representation for this example, real struct copying and comparison have been
        omitted. */
    
        Reward_observation ro;
        Observation  next_observation;
        Reward next_reward;

        next_observation = compute_next_state(a,old_o);

        next_reward = compute_reward(a,old_o);    
        

       ro.o = next_observation;
       ro.reward = next_reward;

        if(next_observation == TERMINAL)
             ro.terminal =1; /* TRUE*/
       else
            ro.terminal = 0; /*FALSE*/

       old_o = ro.o
       return ro;
    }

The function compute_next_state and compute_reward are placeholders for the environment's reward function and state transition function which are up to the environment author to design. TERMINAL and old_o in this case are previously stored values which hold the terminal observation and the previous observation in this environment.

Should the functionality of saving state or random seed's be required, there is one detail which should be observed. State key's and random seed key's are stored in the same abstract type as observations and actions. They can be represented in any way desired, whether that be a hash table in the environment returning a key or compressing all important information int
o a unique collection of doubles and ints, however once env_cleanup is called, all memory of that key should be removed.
   
Looking at the RL-Glue functions, it is easy to see which environment functions are called by RL-Glue functions, and so only those functions being called need to be meaningfully implemented. For example, should the experiment program never call RL_get _state, then env_get_state can return a dummy State_key. To run a useful experiment, the base layer of functionality of RL-Glue is minimally required. This includes: env_init, env_start, env_step, and env_cleanup. The other functions can be left as stubbs of code, where the body of the functions are empty and return values are meaningless. As your experiment requires the more specialized functions like RL_get_state and RL_env_message, you may choose to embellish your implementation of these functions. If your environment is being publicly released, or used in a competition, it is prudent to fully implement all the functions according to the RL-Glue specification.

If any desired functionality is missing in the environment functions, it can easily be accommodated using the env_message function. The env_message has no required functionality. When an experimenter requires the ability to change an element of the environment mid experiment, or be able to gather data from the environment, env_message is the solution. For example, if the goal is to test an agent in a changing environment, after a few episodes a message could be passed through RL_env_message
telling the environment to add a wall to a gridworld or change the physics of the mountain car environment. Alternatively, the env_message functionality could be used to gain information about the current state of the environment (given that the requested information can be conveyed in a string). The functionality provided by env_message is up to the author of the environment.

Writing an Agent

As with the environment, the only two major restrictions on an agent is that it implements all the necessary functions as described in the RL Glue Interface, and include  RL_common.h. The following is the basic starting header file (the title Agent.h is entirely arbitrary) for an Agent to provide a starting point:

    #ifndef Agent_h   
    #define Agent_h

    #include <RL_common.h>

    void agent_init(Task_specification task_spec);
    Action agent_start(Observation o);
    Action agent_step(Reward r, Observation o);
    void agent_end(Reward r);
    void agent_cleanup();
    void agent_freeze();
    Message agent_message(Message);

    #endif

The purpose of agent_init is to allocate any memory/resources required by the agent as well as initialize any values which need to be reset at the beginning of every trial/run. Values which should persist across episodes can be initialized here, values which need to be initialized each episode should be set in agent_start.  An example of something to initialize in agent_init is your value function. The task_specification recieved from the environment as a description of the task and action/state space. The agent should parse the task_specification string to learn what observation and action spaces are for the environment and based on this new knowledge, set up an appropriate value function. A parser already exists in both C and C++ in the RL-Glue Utils folder (click for more details). For example, in the case of a 3X3 gridworld, a sample task_specification would be  1:e:2_[i,i]_[0,2]_[0,2]:1_[i]_[0,3]. This task spec tells the agent that the world is represented by 2 integers (the x,y axis), both of which are between zero and two, (ie a 3x3 gridworld) and that it can take one of four actions, represented by the numbers zero through three. Having received this information, the agent could then set up a 3X3X4 array to contain the state action values.  Again, agent_cleanup is closely related to agent_init in that everything agent_init creates, agent_cleanup should destroy.

Agent_start is called once at the beginning of each episode as the first step of the episode. Given the first_observation (coming from env_start), the agent must decide what action to take based on the agent's initial policy. As there is no reward for the first step, no updates must be done. In agent_step, the agent must again consult it's policy to determine what step to take, however the agent should also perform any value function updates/policy improvement required for learning. The parameters passed in with agent_step are the reward received for the previous action, and the current observation, thus it is likely helpful to save the previous action and observation for learning updates.  An example of an agent_step method for a tabular sarsa agent in a gridworld is given below:

    Action agent_step(Reward r, Observation o)
    {   
            /*
NOTE: structs cannot be copied in a "s1 = s2" fashion. Also arrays are obviously not indexed by structs.
             The following code uses these simplifications for clarity and brevity. */

        action = egreedy(o); /* a function to greedily pick the highest
                                           * valued action most of the time,
                                           *and pick a random action epsilon percent of the time*/
       value[previous_observation, previous_action] += alpha*(r + gamma * value[o, action] - value[previous_observation, previous_action]);
        previous_observation = o;
        previous_action = action;
        return action;
    }

If the environment is episodic, the agent_end method will be called to allow for the experimenter to complete the last step of his/her learning algorithm.

Agent_freeze
is a new function to RL_Glue 2.0. Agent_freeze should freeze the agent's policy and value function so that the agent is no longer learning and is behaving consistently. One easy mistake is to forget any randomness in a policy. If an agent is implementing an epsilon greedy policy as above, the agent will have to remove the epsilon randomness after agent_freeze is called.

Almost anything that isn't covered by these functions can be attempted using the agent_message function. There are no guidelines for agent_message other than it take in a string and spit out a string. Perhaps the input string could be some signal to change the current alpha value of the learning parameter or to request the agent return a string representation of the value function. Either way, it is up to the experimenter to define agent_message according to his/her own needs.
      
How to Compile and Run your Experiment

In a direct environment, compilation of the RL-Glue code is done every time a new or changed agent, environment or experiment program is introduced. The process is not difficult, however while working on a project it can grow tedious to enter the compilation commands repeatedly and can leave room for errors. To help alleviate these issues, and make compilation even simpler, a makefile is suggested. Sample makefiles are included with the RL-Glue software with the different example experiments. As it is best to learn by example, and learning to write makefiles is beyond the scope of this page, we will only discuss modifying these makefiles to work with your code. The makefile described below is found with the direct project in your Examples directory.

Firstly, this example makefile makes some assumptions about the locations of your files. The first assumption is that you have not moved your examples folder at all. If you have moved your examples folder, this is easily fixed by changing the RL-GLUE variable in the makefile. This line is as follows:

# Path to RL-Glue to find RL_glue and related files.
RL-GLUE = ../../RL-Glue

You must replace ../../RL-Glue with the correct path of your RL-GLUE directory. In general, ./ denotes to look in the current directory (where the makefile is held), ../denotes to look up one directory (and ../../ would look up two etc), and ../Code/RL-Glue would look up one directory and then go into the directory named Code to find the RL-Glue directory (For a more detailed explanation of traversing a file system, please look up a more detailed guide).

The second assumption is that your Agent, Environment and Experiment are in the src directory in the same directory as the makefile. If you have moved them, you will have to change the following lines:

# Compile our agent, environment, and experiment
$(OBJECTS): %.o: src/%.c
    $(CC) -c $(CFLAGS) $< -o Build/$@

If you have moved your agent, environment and experiment program code to the same place, replace src/ with the path for the objects. If you wish to put your agent, environment and experiment program in separate locations you will have to write a target (the above is a target for all three) for each piece. For example, if we created an agent directory, an environment directory and an experiment program directory we would change the above to look like this:

# Compile our agent, environment, and experiment
Agent.o: Agent_Folder/Agent.c
    $(CC) -c $(CFLAGS) $< -o Build/$@

Environment.o: Environment_Folder/Environment.c
    $(CC) -c $(CFLAGS) $< -o Build/$@

Experiment.o: Experiment_program_folder/Experiment.c
    $(CC) -c $(CFLAGS) $< -o Build/$@


 If you haven't moved your files but wish to change the name of the agent, environment or experiment (I imagine this would be a popular option, as the whole world may not want to name their environment mines or their agent SarsaAgent) you must change the following line:

OBJECTS = SarsaAgent.o mines.o experiment.o

Note that these are the object files, not the C/C++ files. You don't need to put in the C/C++ files, only the name of the object file (However, if you are changing the build commands, your object files must have the same name as your C/C++ files: ie MyAgent.cpp becomes MyAgent.o etc).

If you need to compile other files that are in the same directory as your agent, environment and experiment you can add the name of their object file (File.c becomes File.o) to the OBJECTS list. If they exist elsewhere, you will want to compile them separately (like we did above when the environment, agent and experiment were not in the same folder) and add them to the list of things linked in with RL_glue (as done in the below example in bold).

# Link our objects into RL_glue
RL_glue: RL_glue.o $(OBJECTS) File.o
    $(CC) -o $@ Build/RL_glue.o $(addprefix Build/, $(OBJECTS)) Build/File.o


The only other portion of this makefile that should be changed is the compiler flags. To change the compiler and compiler flags being used you can change the following lines:

# Compiler flags
CC      = gcc
CFLAGS  = -I$(RL-GLUE)/ -ansi -pedantic -Wall
LDFLAGS =  


If your code is in C++, remember to change the compiler, denoted by the CC variable, to g++ or your own preferred C++ compiler. The given compiler flags are named CFLAGS. In our example we use 4 flags, I will briefly discuss why they are in this example. -I$(RL-GLUE) is slightly confusing as it is a fusion of a compiler flag and a makefile command. The -I is a compiler flag, telling the compiler to look in another directory for any missing files. The -I should be followed by the directory to look in,  however in this case it is followed by a makefile variable which may look confusing. The $(RL-GLUE) inserts whatever is stored in the RL-GLUE variable, so in reality the statement -I$(RL-GLUE) really says -I ../../RL-Glue. This lets the compiler see RL_common.h and RL_glue.h, which your agent/environment/experiment program need to compile. If you need to access other files, such as the Glue Utilities, you should add another -I flag to your CFLAGS pointing to where your other files reside. The ansi, pedantic, and Wall flags are requesting varying levels of warnings from the compiler. These can be harmlessly removed, however they help you write safer code (for more details look up the man pages for gcc). In this example there was no need for linking flags, denoted by LDFLAGS, however you can add your own.

After saving any changes and quitting your makefile you should be able to type make in the directory with your makefile and it will create an executable named RL_glue. This executable, depending upon the example, will either appear in the working directory or in a bin directory within your working directory. The RL_glue executable is your experiment, just type ./RL_glue (or ./bin/RL_glue if the executable was put in a bin directory) to run and wait for your results! One other useful things to know: if you want to get rid of the .o files you can type make clean. make tidy will remove old saved versions of your code (ie the agent.c~ files created by emacs ).



Working with a Socketed or Mixed Approach
Writing the agents/environments/experiments with a socketed approach has no added restrictions or protocol changes in comparison to a direct approach. The only real difference is the compilation process involved and the advantage of being able to use different languages for each piece (which of course comes with the features and fallbacks of the various languages as well). It is entirely possible to take a C agent, environment, and experiment which were previously all directly connected together , and compile them to run with sockets without any changes being made to the agent/environment/experiment files. That being said, the following discussions are provided as a more abstract discussion about approaching writing agents, environments and experiments with less emphasis on implementation details than above.  Language specific details can be located on the RL_common page. Lastly, we will discuss how to compile your files to run with a socketed approach.

Writing an Experiment Program

As was mentioned above, the most important thing to keep in mind when writing an Experiment, Agent or Environment is knowing the RL Glue Interfaces. In any language, the RL-Glue functions are available to the experiment program and the agent/environment functions should not/cannot be directly accessed. The duties of the experiment program are still the same as described above: a) start the  experiment b) specify how long/how many times to run the experiment c) extract data and possibly analyze d) end the experiment and clean up.

The simplest experiment program to write runs exactly one episode and consists of only three function calls: RL_init(), RL_Episode(int), and RL_cleanup(). Most variations of the experiment program are simple loops of this form. A major detail to keep in mind is that for every call to RL_init, a corresponding call to RL_cleanup must be made. Typically experiments will require many trials/runs consisting of many episodes.  This is done by adding loops in the appropriate spots in the above basic experiment program. The basic form is as follows:

     for(number of trials)

       RL_init()
        for(number of episodes per trial)
           RL_Episode(int)
        RL_cleanup()

The most pivotal detail to note is that RL_init and RL_cleanup are only called on a per trial basis, not a per episode. This is because a call to RL_init will reset the value function and policy.

While the above three functions will handle agent learning, to evaluate the agent's progress knowledge of the RL-Glue helper functions is necessary. The RL_return and RL_num_steps functions will help gain information about the agents performance to evaluate their performance. The most likely RL_freeze enables the experiment program to freeze an agent's policy so that the experiment can consist of a period of training and a period of testing to evaluate the final value function and policy. RL_get_state, RL_set_state allow the experiment program to capture the state of the environment and reload it later. This does NOT affect the agent's policy however. One use would be to save a particularly difficult position in the environment and replace the agent in that situation later after much training to see if the agent's performance has improved. RL_get_random_seed and RL_set_random_seed will save and restore the "random" numbers being used by the environment to simulate any randomness in the environment's state transition/reward functions. Note that you will likely want to initialize the random seed at the beginning of the environment program. If the experiment program's goal is to train the agent on a particular string of events repeatedly, the random_seed functions can be used to ensure "consistent randomness" over the events. 

RL_env_message and RL_agent_message can be used to talk to the environment or agent. This may be done to change internal parameters or to extract data from the environment/agent which cannot be done through any of the other RL functions. For more details check out env_message or agent_message.

Lastly, there are two functions, RL_start and RL_step which are used in RL_episode, however the user has access to these as well. Should the experiment program need to print out a trace of the actions taken,  the observations seen, or the rewards given, using RL_start and RL_step instead of RL_episode will provide access to the required data. The data type returned by RL_start and RL_step is provided with the RL_common information.

Writing an Environment

One of the changes in RL-Glue 2.0 is that all functions in the interface must be defined. For the Environment, this means writing the following nine functions:

    Task_specification env_init()
    Observation env_start()
    Reward, Observation, Terminal env_step(Action)
    void env_cleanup()
    void env_set_state(State_key)
    void env_set_random_seed(Random_seed_key)
    State_key env_get_state()
    Random_seed_key env_get_random_seed()
    Message env_message(Message)

Depending upon which language is being used, the true underlying data structures for the above data types (such as the Observation, Reward, Task_specification etc.) will change. For each language RL-Glue currently works with, it comes with a RL_Common file (click here to find details on your language) with the definitions of these data types. For example, when writing in C one would use the RL_ common.h, or when using python one would choose RL_common.py. It is imperative to use the defined types in the RL_common file. In C, the header file (RL_common.h) must be #included by the environment file. For each language there will be an analogous import action, such as python's include script_name.  

The best place to start when writing an environment is deciding how to represent the state of the environment. According to the RL_common definition of Observations, a state must be represented in such a way that it can then be represented by a collection of ints/doubles. Internally, one may choose to represent a state with any data type you choose, be it a string, an int, or even images. The caveat to this is that there must be a way of extracting a representation entirely of ints and/or doubles from the state to pass out of the Environment. An example of this is the cat and mouse game. The environment could internally keep a map of the environment, knowing where the edges and traps and walls are in any manner it likes just so long as it is able to relay the position of the mouse and the position of the cat to the agent through ints and doubles. One implementation may be to enumerate all the possible combinations of where the mouse and the cat are and return the single int to represent the state Observation. Another possibility is to return the x and y coordinates of both the mouse and the cat. Determining how to represent the Observation is a key observation to make.

After determining how to represent the current state of the environment, the next step is to determine what the state transition function and the reward function will be.  These can be highly explicit, where each state and action pair is stored in a table with a corresponding entry for the next state or reward. Others, where the state/action space is much too large or is continuous, can be based on some calculation on the current state/action values. For example, in mountain car the state can be represented by the distance from the goal and the current velocity and therefore the next state can be calculated by some combination of the two and the next action.

Env_init provides a place to initialize any data structures necessary to represent the environment and also provides an opportunity to initialize any data which must persist between episodes (recall an episode is a sequence of RL_start, RL_step).  Lastly env_init must return the task_specification. This is a string (or string-like data structure, you can check the RL_common page for a particular language) which specifies the details of the Observation and Action space. The protocol for writing one is found here: task_specification.

Env_start is the official start of the "experiment". Env_start initializes any values which need to be set on an episodic basis as well as returns the initial Observation based on placing the agent in the start state. After env_start is called, the action-observation-reward cycle between the agent and environment begins. The agent takes the start state, chooses an action, and then waits for the environment to specify what the consequences are. These "consequences" come from env_step. In env_step, the environment accepts the agent's action and uses it to determine what state to transition to, and what reward was earned as a result of taking that action. One thing to note is that env_start returns an Observation, where as env_step returns the reward, the Observation, and the "terminal" variable (again, determining how to return this for your language is critical. In some languages, these three elements must be put into a wrapper due to their "one return value" policy. Check the RL_common page). If the agent enters into a terminal state (in the episodic case) the terminal variable is set. In some languages this is done using an int set to 1, in others a boolean may be used. Again, it is important to check RL_common for the data type used for the language of choice.

Env_cleanup is called at the end of each set of episodes. Every time env_start is called there should be a corresponding call to env_cleanup. Env_cleanup is used to de-allocate any memory allocated in env_start. All the above environment functions are the basic core group of functions required by RL-Glue. The rest of the environment functions must be defined, but can be empty functions or functions which return dummy values if they are not needed.

Env_set_state and env_get_state  are available to allow the experimenter to capture the environment in a state and reload that state later in the experiment. The captured state does not include the agent's value function however, so upon returning to that state later the agent may behave differently. One use for the env state functions may be to focus training in a region by continually replacing the agent in a certain portion of the state space.

Env_get_random_seed and env_set_random_seed are similar to env_get_state and env_set_state in their ability to help capture the state of the environment. If an environment has some randomness in it's behaviour the random_seed functions can be used to "save" and "reload" the randomness to regenerate the same sequence of events. For example, if an agent was behaving oddly on a particular sequence of state transitions, by getting the state and the random seed, one could reset the state and random_seed over and over to generate experience on that particular sequence of events. Also it is often a good idea to set the random seed from the experiment program at the beginning of the learning experiment.

If any functionality is missing in the environment functions, it can easily be accommodated using the env_message function. The env_message has no required functionality. When an experimenter requires the ability to change an element of the environment mid experiment, or to gather data from the environment, env_message is the solution. For example, if the goal is to test an agent in a changing environment, after a few episodes a message could be passed through RL_env_message telling the environment to add a wall to a gridworld or change the physics of the mountain car environment. Alternatively, the env_message functionality could be used to gain information about the current state of the environment, given that this information can be conveyed in a string. The functionality of env_message is up to the author of the environment.

Writing an Agent

Just like the environment, the agent must implement all of the agent functions in the RL-Glue interface and follow the type definitions for Observations, Actions, rewards and all the other input/output variable types. The following seven functions must be defined in agent code (though if they are empty and/or return empty values RL-Glue will not complain):
   
    void agent_init(Task_specification)
    Action agent_start(Observation)
    Action agent_step(Reward , Observation)
    void agent_end(Reward)
    void agent_cleanup()
    void agent_freeze()
    Message agent_message(Message)

The purpose of agent_init is very similar to that of env_init. Any memory or resources that are required throughout the agents code should be allocated and initialized in env_init. Also, values which must be reset every trial (but persist through each episode) should be initialized here.   Within agent_init you can parse the task_specification to discover more details about the action and state space. Based on this knowledge the agent can determine how to store it's value function. In the case of  small discrete action and observation spaces, a tabular value function may be stored in an array like structure where as a continuous space may require function approximation or some other approach. Determining how to store the value function will affect what memory is allocated and initialized.
   
Every episode takes it's first step from a call to agent_start. Given the first_observation (coming from env_start), the agent chooses an action according to it's initial policy. No learning is done in agent_start as no rewards have been earned. In agent_step the agent again selects an action based on the current policy but it should also perform value updates and policy improvements required for learning. The parameters passed in with agent_step are the reward received for the previous action, and the current observation. To perform useful learning updates it will likely be helpful to save the previous action and observation.

In an episodic environment, RL-Glue will call agent_end when the agent enters a terminal state. A call to agent_end allows the agent to do any final updates to the value function and policy. At the end of a trial agent_cleanup will be called to allow for the deallocation of any resources created in agent_init. Calls to agent_init and agent_cleanup should always be in a one to one ratio.

Two new functions to RL-Glue as of version 2.0 are agent_freeze and agent_message. \A call to agent_freeze should stop all learning in the agent and also inhibit random moves by the agent. This allows for a distinct training and testing phase for an agent. Agent_message is a very generic function open to the authors interpretation. The only requirements are that agent_message take in a string and pass out a string. There are no stipulations on the organization or content of the string. One use could be to allow an external source to change the alpha learning parameter in a sarsa agent. The author of agent_message could specify that any input string of the form "Alpha:x" will change the alpha value to x. Another example could be if the agent_message author wanted to view the values for a certain state, an input string with the values of a state could be passed in and then the correct value returned as a string. Anything you can imagine and implement is allowed with agent_message.

      
How to Compile and Run your Experiment

In a socketed environment, compilation is only done on parts of the code which are changed. If an agent changes and nothing else, only the agent needs to be recompiled. A sample makefile in 5 parts is provided with the example network C code and can easily be adapted to suit your own needs. The 5 parts of the makefile are: makefile, RL_agent.makefile, RL_environment.makefile, RL_glue.makefile, and RL_experiment.makefile.

The main makefile should not require any changes, it only points to all the other makefiles that need to be included. There are two portions in the four sub makefiles which should be changed in unison: the location of RL-Glue and the location of the Build path. In each makefile are two variables: RL-GLUE and BUILD_PATH. For the makefiles to work, RL-GLUE must have the correct path to the RL-Glue directory on your computer.
In general, ./ denotes to look in the current directory (where the makefile is held), ../denotes to look up one directory (and ../../ would look up two etc), and ../Code/RL-Glue would look up one directory and then go into the directory named Code to find the RL-Glue directory (For a more detailed explanation of traversing a file system, please look up a more detailed guide). The BUILD_PATH doesn't need to be the same on all four files, but it is a general good practice. This can be changed in the same way as the RL-GLUE path.

Another similarity that exists across all four makefiles in this example is the compiler and the compiler flags portion. These can be changed for each file to be different. It appears as such in all four files:

CC      = gcc 
CFLAGS  = -I$(RL-GLUE)/ -ansi -pedantic -Wall
LDFLAGS =

If your code is in C++ you should change the compiler, denoted by the CC variable, to g++ or your own preferred C++ compiler. The given compiler flags are named CFLAGS. In our example we use 4 flags, I will briefly discuss why they are in this example. -I$(RL-GLUE) is slightly confusing as it is a fusion of a compiler flag and a makefile command. The -I is a compiler flag, telling the compiler to look in another directory for any missing files. The -I should be followed by the directory to look in,  however in this case it is followed by a makefile variable which may look confusing. The $(RL-GLUE) inserts whatever is stored in the RL-GLUE variable, so in reality the statement -I$(RL-GLUE) really says -I ../../RL-Glue. This lets the compiler see RL_common.h and RL_glue.h, which your agent/environment/experiment program need to compile. If you need to access other files, such as the Glue Utilities, you should add another -I flag to your CFLAGS pointing to where your other files reside. The ansi, pedantic, and Wall flags are requesting varying levels of warnings from the compiler. These can be harmlessly removed, however they help you ensure your code is safer (for more details look up the man pages for gcc). There were no linking flags, denoted by LDFLAGS, needed in this example but these can be added here.

RL_glue.makefile is probably the most confusing, or at least largest, of the four sub- makefiles. The rest of this file should not be touched unless you are going to compile in mixed mode. If you are compiling in a mixed mode you will need to change a number of things throughout the other makefiles as well. For example, if we wanted to compile the agent in with the glue, we would have to change:

OBJECTS = RL_glue.o RL_network.o RL_server_agent.o RL_server_environment.o RL_server_experiment.o

by adding our Agent object files to this list. For example, in this makefile the above line would become:

OBJECTS = RL_glue.o RL_network.o RL_server_agent.o RL_server_environment.o RL_server_experiment.o SarsaAgent.o Glue_utilities.o RL_client_agent.o

You also need to remove the following lines from the RL_agent.makefile:

RL_agent: $(OBJECTS)
    $(CC) -o $@ $(add prefix $(BUILD_PATH)/, $(OBJECTS))

Lastly, you need to ensure the Build Path in the RL_agent.makefile and the RL_glue.makefile are the same, or ensure that the RL_glue.makefile knows where to look for the agent object files.

If you are sticking to a pure network compilation process, you can read below to learn how to change the provided makefile in the Network C example for yourself. If you are interested in a mixed mode, there are examples of makefiles in mixed compilation mode with some of the other Examples provided with RL-Glue.

To change RL_agent.makefile to work with your own C/C++ agents you only need to change all instances of SarsaAgent to the name of your agent. If your agent requires other files (for example if you wrote your own helper functions in a separate file), you only need to add the object file to the OBJECTS variable and then add a build rule for your new file. For example if your agent needed the file helper.cpp,  the OBJECTS variable should be modified by adding the following bold text:

AGENT_OBJECTS = SarsaAgent.o Glue_utilities.o RL_client_agent.o RL_network_agent.o helper.o

After which you would add a build rule for the helper folder like this:

helper.o: helper.c
    $(CC) -c $(CFLAGS) $< -o $(BUILD_PATH)/$@

The same is true of RL_environment.makefile and RL_experiment.makefile: you only need to replace mines or experiment with the name of your own environment and experiment. Other files can also be included by adding the object file to the OBJECTS variable and then add a compilation rule for the new file.

To learn how to compile using different languages you can replace the RL_environment.makefile, RL_agent.makefile and RL_experiment.makefile with makefiles which compile your environment/agent/experiment in their own language. The only stipulation is that they compile with the provided RL_client files for your language. As each language is supported in RL-Glue, an example including using this language will be provided to help you develop your own makefile.

After saving any changes and quitting your makefile you should be able to type make in the directory with your makefile and it will create four executables named RL_glue, RL_agent, RL_experiment,  and RL_environment. Run RL_glue by typing ./RL_glue into your terminal, you can then connect your agent/environment/experiment by running the three corresponding executables in any order you choose. You can connect and disconnect your agent/environment/experiment as many times as you want, however once all three have connected the experiment will run through to the end.  Other useful things to know: if you want to get rid of the .o files you can type make clean. make tidy will remove old saved versions of your code (ie the agent.c~ files created by emacs ).