Home Reinforcement Learning and Artificial Intelligence (RLAI)
Frequently Asked Questions about the RL-Glue

Edited by Leah Hackman Leah Hackman, June 19, 2007

The ambition of this page is to provide answers to common questions about the RL-Glue.


To add a new question click "Extend this Page" at the bottom of the page.


       How ...
          ... can RL-Glue handle sampling the same trajectory a number of times consecutively?
          ... can I create a set of random starting states?
          ... can my agent distinguish between training (e.g. exploration) and testing phases?
          ... would I test after, every 10 training runs (ie collect statistics after every tenth iteration)?
          ... is the "environmental state" stored in RL-Glue?

       What ...
          ... does Observation mean?
          ... task specification are we supposed to use?

       Why...
          ... is there no RL_unfreeze and RL_frozen?

       Problems:
          Using multiple dimensional reward signals (typedef double* Reward).
          Some combinations of environments and agents did not seem to work properly.
          Average reward and cumulative reward output.
 


Question: What languages can I write my agent and environment files in?

Answer: As of RL-Glue 2.0, C, C++, and Python are supported. We soon hope to have Java and Lisp added.


Question: What does Observation mean? Why does the RL-Glue not pass around "states"?

Answer: Observation is a more  general term, to which the concept of state, and state of say mountain car is a subset. Observation can be an array of doubles or an int or whatever. This is controlled in the common types file for each agent-environment pair.


Question: Where is the "environmental state" stored in RL-Glue? In other systems, such as CLSquare, the old state is passed to the environment step function.

Answer: The environment in RL-Glue is responsible for keeping  track of the current "state" and computing the next "state" given an action. Old state need not be passed. The state stays within the  environment. The next_state method in CLSquare is basically the same as env_step in RL-Glue.


Question: Can RL-Glue handle sampling the same trajectory a number of times consecutively?

Answer: This can be done in RL-Glue using a couple of optional methods. Using env_get_state one can get a key for current environmental state. Then use env_set_state to initialize the environment to a particular state  given the key. Using env_get_random_seed the user can acquire the random seed used by the environment, and reset it using env_set_random_seed in the same fashion as getting and setting state described above. More details on this can be found here.


Question: I tried to specify the reward function using multiple dimensions (typedef double* Reward) but I got a compilation error in: RL_Interface.c:78:  total_reward = total_reward + last_reward, since it assumes a single dimension. Why can't I use multi-dimensional reward signals?

Answer: The RL-Glue subscribes to the reward hypothesis: All of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward). We believe all problems can and should be represented this way.


Question: The RL-Glue gives me the following string on initialization of an old mines environment: "(1:1_1_108:1_1_4)" but it is not a valid task specification if you believe the website. The other environments use another format for the string. What specification are we supposed to use?

Answer:  The reason there are some inconsistencies is because when the first distribution was released we had not settled on a task description language. Just recently we have settled on a suitable task_description language.  We are currently in the process of updating our library to bring This is not guaranteed to be the last one. We want to make a standard, so deciding these important issues is very time consuming. The website format is the one you should use.


Question: How can I create a set of random starting states?

Answer: You can use the optional methods. In particular you would need [env_get_state(), env_set_state] and [env_get_random_seed()   env_set_random_seed]. To use these, the environment code must define them AND the types file (eg mine_common.h) must contain the lines

        #define ENV_GET_STATE
        #define ENV_SET_STATE
        #define ENV_GET_RANDOM_SEED
        #define ENV_GET_SET_RANDOM_SEED

These lines tell the interface you will provide definitions for these methods. Then you could do something like (in a main file):

        //Disclaimer, may not compile... :)
        #include "RLcommon.h"

        State_key vec = new int[1000];    //ASSUME state_key is defined as an int in RL_common.h
        RL_init();
        for (int i = 0; i < 100; i++)
        {
            RL_start();
            vec[i] = RL_get_state();
        }

        //Other code .....

Now you have an array holding 1000 starting states.


Question: How can my agent distinguish between training (e.g. exploration) and testing phases?

Answer: A call to RL_freeze will invoke agent_freeze. Agent_freeze should stop all learning and random behaviour, allowing for testing. To check if an agent is frozen and to unfreeze an agent can be optionally implemented through RL_agent_message and agent_message, there is no standard RL function for unfreezing or checking.

Question: Why is there no RL_unfreeze or RL_frozen?

Answer: These two functions can easily be implemented through RL_agent_message and agent_message. There is no standard RL function for unfreezing or checking if the agent is frozen. RL_freeze is implemented explicitly because it is a function that is very frequently used. Often experiments will have a testing and training period, which RL_freeze facilitates, however it is less frequent that an agent will test, train and then re-test. RL_freeze is provided out of convenience, not necessity. To avoid the RL-Glue interface becoming bloated, we are trying to avoid adding too many redundant functions however when a function is frequently asked for, like RL_freeze, we feel it is an appropriate exception to make.


Question: Some combinations of environments and agents did not seem to work since they assume a different specification (states specified as an array, single dimension, etc). For example, the agents I added do not work with problems with single dimensions, since I added for-loops which assume an array specification of the observations and actions.

Answer: Although we would all like to see it, it is not the current state of the art to have a single agent that solves many problems. Most write agents for particular tasks. However, we expect that an agent that works with multi-dimensional continuous state should be able to solve mountain car and say cart-pole. The RL-Glue fully supports this, with some setup work required in the agent_init method. That said, its not likely that an agent made for a multi dimensional task should work with a tabular agent. Reasonable subclasses exist and the workshop organizers have done a good job illustrating them in the benchmark announcement. Take a look at SarsaAgent.cpp. This should work with any tabular problem. A similar multi dimension continuous agent could be easily extended form TileAgent.cpp.


Question:  I used the main.c file and at the end of a test run I got the following output:

        The final average reward is = 0.500000
        The final cumulative reward is = -15.000000

These two statements seem inconsistent to me.  Since there were 100 episodes, shouldn't the average reward be -15/100?  That is definitely not equal 0.5.  Am I doing something wrong?

Answer: No! The average reward is over each episode (recomputed for each call of episode). The one displayed is the average reward from the last episode run. The cumulative reward is not cleared between episodes. It is simply the sum of rewards from beginning of experiment to the last step of the last episode. This output is correct.


Question: I want to test after, every 10 training runs (ie collect statistics after every tenth iteration). Can I do this?

Answer: So run 10 episodes, find average reward over those 10, run 10 episodes, find average reward over those 10, .....

If so you would have to write a special collection routine for that in RL_util.h. Maybe called collectStatsSkip(int x, int n). Where n is how often you average (n=10 for you) and x is the episode number. So imagine a main_episodic_online.c with a run method that looked like the following:

        void run(int num_episodes)
        {       
            printf("\nBeginning online learning ");
            for(int x=0; x < num_episodes; x++)
                {
                    RL_episode();
                    printf(".");
                    fflush(stdout);
                    collectStatsSkip(x,10);   //<<<< HERE <<<<<<<
            }
            printf("\nDone.");       
        }

It should be easy to write collectStatsSkip(int x, int n). RL_util is well commented and the existing methods(collectStats) are very easy to understand.
The main take home message here is as follows. All the main files and the stats routines in RL_util.h are examples of how your experiment might look. They by no means cover all possible setups. They were written very clearly so that others could look at them and easily modify them or create new methods for their own needs. After all, people want to run experiments there own way, they want to collect data in there own way, so we believe this is a good approach.



 


Does RL-Glue support multi-agent problems?  

No RL-Glue was designed for single-agent RL.