Question: What languages can I write
my agent and
environment files in?
Answer:
As of RL-Glue 2.0, C, C++, and Python are supported. We soon
hope to have Java and Lisp added.
Question: What does Observation
mean? Why does the RL-Glue not pass
around "states"?
Answer: Observation is a
more general term, to which the concept of
state, and state of say mountain car is a subset. Observation can be an
array of doubles or an int or whatever. This is controlled in the
common types file for each agent-environment pair.
Question: Where is the
"environmental state" stored in RL-Glue? In
other systems, such as CLSquare, the old state is passed to the
environment step function.
Answer: The environment in
RL-Glue is responsible for keeping
track of the current "state" and computing the next "state" given an
action. Old state need not be passed. The state stays within the
environment. The next_state
method in CLSquare is basically the same as env_step in RL-Glue.
Question: Can RL-Glue
handle sampling the same trajectory a number of
times consecutively?
Answer: This can be done in
RL-Glue using a couple of optional
methods. Using env_get_state
one can get a key for current environmental state. Then use env_set_state to initialize the
environment to a particular state given the key. Using env_get_random_seed the user can
acquire the random seed used by the environment, and reset it using env_set_random_seed in the same
fashion as getting and setting state described above. More details on
this can be found here.
Question:
I tried to specify the reward function using multiple dimensions
(typedef double* Reward) but I got a compilation error in:
RL_Interface.c:78: total_reward = total_reward + last_reward,
since it assumes a single dimension. Why can't I use multi-dimensional
reward signals?
Answer: The RL-Glue
subscribes to the reward hypothesis: All of what we mean by goals and
purposes can be well thought of as maximization of the expected value
of the cumulative sum of a received scalar signal (reward). We believe
all problems can and should be represented this way.
Question: The RL-Glue gives me the
following string on initialization of an old mines environment:
"(1:1_1_108:1_1_4)" but it is not a valid task specification if you
believe the website. The other environments use another format for
the string. What specification are we supposed to use?
Answer: The reason there
are some inconsistencies is because when the first distribution was
released we had not settled on a task description
language. Just recently we have settled on a suitable task_description
language. We are currently in the process of updating our library
to
bring This is not guaranteed to be the last one. We want
to make a standard, so deciding these important issues is very time
consuming. The website format is the one you should use.
Question: How can I create a set of
random starting states?
Answer: You can use the
optional methods. In particular you would need [env_get_state(),
env_set_state] and [env_get_random_seed()
env_set_random_seed]. To use these, the environment code must define
them AND the types file (eg mine_common.h) must contain the lines
These lines tell the interface you will provide definitions for these
methods. Then you could do something like (in a main file):
//Disclaimer, may not compile...
:)
#include "RLcommon.h"
State_key vec = new
int[1000]; //ASSUME state_key is defined as an int in
RL_common.h
RL_init();
for (int i = 0; i < 100; i++)
{
RL_start();
vec[i] =
RL_get_state();
}
//Other code .....
Now you have an array holding 1000 starting states.
Question: How can my agent
distinguish between training (e.g. exploration) and testing phases?
Answer: A call to RL_freeze
will invoke agent_freeze. Agent_freeze should stop all learning and
random behaviour, allowing for testing. To check if an agent is frozen
and to unfreeze an agent can be optionally implemented through
RL_agent_message and agent_message, there is no standard RL function
for unfreezing or checking.
Question:
Why is there no RL_unfreeze or RL_frozen?
Answer: These two functions can easily be implemented through
RL_agent_message and agent_message. There is no standard RL function
for unfreezing or checking if the agent is frozen. RL_freeze is
implemented explicitly because it is a function that is very frequently
used. Often experiments will have a testing and training period, which
RL_freeze facilitates, however it is less frequent that an agent will
test, train and then re-test. RL_freeze is provided out of convenience,
not necessity. To avoid the RL-Glue interface becoming bloated, we are
trying to avoid adding too many redundant functions however when a
function is frequently asked for, like RL_freeze, we feel it is an
appropriate exception to make.
Question: Some combinations of
environments and agents did not seem to work since they assume a
different specification (states specified as an array, single
dimension, etc). For example, the agents I added do not work with
problems with single dimensions, since I added for-loops which assume
an array specification of the observations and actions.
Answer: Although we would all
like to see it, it is not the current state of the art to have a single
agent that solves many problems. Most write agents for particular
tasks. However, we expect that an agent that works with
multi-dimensional continuous state should be able to solve mountain car
and say cart-pole. The RL-Glue fully supports this, with some setup
work required in the agent_init method. That said, its not likely that
an agent made for a multi dimensional task should work with a tabular
agent. Reasonable subclasses exist and the workshop organizers have
done a good job illustrating them in the benchmark announcement. Take a
look at SarsaAgent.cpp. This should work with any tabular problem. A
similar multi dimension continuous agent could be easily extended form
TileAgent.cpp.
Question: I used the main.c
file and at the end of a test run I got the following output:
The final average reward is =
0.500000
The final cumulative reward is =
-15.000000
These two statements seem inconsistent to me. Since there were
100 episodes, shouldn't the average reward be -15/100? That is
definitely not equal 0.5. Am I doing something wrong?
Answer: No! The average reward
is over each episode (recomputed for each call of episode). The one
displayed is the average reward from the last episode run. The
cumulative reward is not cleared between episodes. It is simply the sum
of rewards from beginning of experiment to the last step of the last
episode. This output is correct.
Question: I want to test after,
every 10 training runs (ie collect statistics after every tenth
iteration). Can I do this?
Answer: So run 10 episodes,
find average reward over those 10, run 10 episodes, find average reward
over those 10, .....
If so you would have to write a special collection routine for that in
RL_util.h. Maybe called collectStatsSkip(int x, int n). Where n is how
often you average (n=10 for you) and x is the episode number. So
imagine a main_episodic_online.c with a run method that looked like the
following:
It should be easy to write collectStatsSkip(int x, int n). RL_util is
well commented and the existing methods(collectStats) are very easy to
understand.
The main take home message here is as follows. All the main files and
the stats routines in RL_util.h are examples of how your experiment
might look. They by no means cover all possible setups. They were
written very clearly so that others could look at them and easily
modify them or create new methods for their own needs. After all,
people want to run experiments there own way, they want to collect data
in there own way, so we believe this is a good approach.