Home Reinforcement Learning and Artificial Intelligence (RLAI)
RL-Glue Task Specification Language


Edited by Leah Hackman Leah Hackman, June 19, 2007

The ambition of this page is to present a specific proposal for a language for describing tasks -- agent-environment interfaces -- to be used as a Task_specification in calls from env_init and to agent_init in the RL-framework.


Task Description Language

In an effort to provide the agent writer with simple and concise information about the environment a Task_specification is passed from the environment, through the interface, to the agent. The environment's init method (env_init) encodes information about the problem in a ASCII string. The string is then passed to the agent's init method (agent_init).  This information can also be used to check that the agent and environment are suitable for each other. A few example Task_specifications are provided below.

The agent is responsible for parsing any relevant information out of the Task_specification in the init method. A generic Task_specification parsing function  is provided with RL-Glue 2.0 for all C/C++ users. This simple parser will return a structure containing the information encoded in the Task_specification, such as the Observation and Action dimensions, arrays of Observation and Action variable ranges, and arrays of Observation and Action variable types. More information about the parser and the parsed task spec struct can be found here.



Task_specification


The Task_specification is stored as a string with the following format:

        "V:E:O:A:R"

For example, this is a sample task_specification provided as one of the examples below:

        "2:e:1_[i]_[0,N-1]:1_[i]_[0,3]:[-1,0]"

The V corresponds to the version number of the task specification language. E corresponds to the type of task being solved. It has a character value of 'e' if the task is episodic and 'c' if the task is continuing. O and A correspond to Observation and Action information respectively. Finally, the R corresponds to the range of rewards for the task. Within each of O, A and R a range can be provided, however if the values are unknown or infinite in magnitude, two special input values have been defined.

The format of O and A are identical. We will describe the form of O only. O contains three components, separated by underscore characters ("_") :

        #dimensions_dimensionTypes_dimensionRanges

#dimensions is an integer value specifying the number of dimensions in the Observation space. dimensionTypes is a list specifying the type of each dimension variable. The dimensionTypes list is composed of #dimensions components separated by comma characters (",") within square brackets ([x1,x2,x3,..., xn] where xrepresents the ith value). Each comma-separated value in the list describes the type of values assigned to each Observation variable in the environment. In general, Observation variables can have one of the following 2 types:

        'i' - integer value
        'f' - float value

Thus a dimensionTypes list corresponding to an Observation space with 1 dimension has the following form:

       [a]
where a is an element of ['i','f']
 
So a dimensionTypes list with one integer value would be:         [i]

An Observation space with 2 dimensions would have a dimensionTypes with the following form:

       [a,b]
where a and b are elements of ['i','f']

indicating the value type of the first (a) and second (b) Observation variables. Thus a three dimensional Observation with one float, integer dimension and another float dimension would have the following dimensionTypes:

       [f,i,f]

The dimensionRanges is a list specifying the range of each dimension variable in the Observation space.  The dimensionRanges is composed of #dimensions components separated by underscore characters. Each dimensionRanges component specifies the upper and lower bound of values for each Observation variable. If the bounds are unknown or unspecified, you can leave an empty space in the place of a value. If the bounds are positive or negative infinity, you can use inf or -inf to represent your range. These can be used in combination. For example one valid range could be an unknown lower bound and infinite upper bound, or a lower bound of -inf and an upper bound of 1. You can be as precise (though you must be accurate) as you wish. A dimensionRanges corresponding to an Observation space with a single dimension variable would have the following form:

        [O1MIN, O1MAX]

So a dimensionRanges list for binary dimension varaible would be: [0,1]
A dimensionRanges list for variable with no upper or lower bound unspecified would be: [] or [,] (both are valid).
A dimensionRanges with one or two unbounded value can take on a value of inf or -inf. Eg [0, inf] or [-inf,1] or [-inf,inf].

An Observation space with 2 dimensions would have a dimensionRanges with the following form:

        [O1MIN, O1MAX]_[O2MIN, O2MAX]

indicating the minimal and maximal values of Observation variables O1 and O2 respectively. This definition can be then trivially extended to Observation spaces with N dimensions.

NOTE: the dimensionRanges of an Observation space with 1 or more unbounded values may not be representable in this way. An unbounded value has no minimal or maximal range. Thus, we simply do not specify the range in the dimensionRanges for any Observation variables with unbounded values. For example, consider a problem with 3 Observation dimensions. The first and third Observation variables have interval values and the second has unbounded ratio value. The corresponding dimensionRanges for this problem is encoded as:

       [O1MIN, O1MAX]_[,]_[O3MIN, O3MAX]

indicating the minimal and maximal values of Observation variables O1 and O3.

The format of A (Action space information) is identical to that of O (Observation space information) and thus the definitions above hold for Action spaces.

Lastly the R (Reward space information) is merely a range specifier. By the Reward Hypothesis there is only ever one reward signal (which in RL-Glue is always a floating point number) so the #dimensions and dimensionType information becomes meaningless. The reward range can again be specified to be unknown or infinite in the same manner as the Observation ranges.  A rewardRange follows the following form:
   
    [rewardMin, rewardMax]

In the case of a reward with rewards -1 or 0 the rewardRange would appear as such: [-1,0].
If no lower bound was known and the upper bound was positive infinity, the rewardRange would appear as such: [,-inf]



Example Task_specifications


Consider a simple gridworld with Actions North, South, East and West and a single dimension Observation of grid position. If we encode actions as 0, 1 ,2 ,3 and position as an integer between 0 and N-1, we get the following Task_specification:

        "2:e:1_[i]_[0,N-1]:1_[i]_[0,3]:[-1,0]"

This Task_specification provides the following information:

        - RL-Glue version 2.0 supported
        - the task is episodic
        - Observation space has one dimension
        - the Observation variable has integer values (discrete state)
        - range of Observation variable is 0 to N-1
        - Action space has one dimension
        - the Action variable has integer values (discrete actions >> tabular)
        - range of Action variable is 0 to 3
        - range of the rewards is -1 to 0

For a more complex illustration of the expressiveness of the Task_specification language, consider the Mountain Car problem. The Actions available to the agent are full throttle reverse, zero throttle, and full throttle forward. The Observation consists of the cars position and velocity. If we encode Actions as 0, 1, and 2 respectively and position and velocity as real values with finite ranges, we get the following Task_specification:

        "2.0:e:2_[f,f]_[-1.2,0.5]_[-.07,.07]:1_[i]_[0,2]:[-1,0]"

This Task_specification provides the following information:

        - RL-Glue version 2 supported
        - the task is episodic
        - Observation space has two dimensions
        - the first Observation variable has float values
        - the second Observation variable has float values (2D continuous state)
        - range of Observation variable one is -1.2 to 0.5
        - range of Observation variable two is -0.07 to 0.07
        - Action space has one dimension
        - the Action variable has integer values (discrete actions)
        - range of Action variable is 0 to 2
        - range of the Rewards is -1 to 0


A final example from a fabricated problem:

        "2.0:e:2_[i,f]_[,]_[-inf,inf]:1_[i]_[0,2]:[-1,0]"

This Task_specification provides the following information:

        - RL-Glue version 2 supported
        - the task is episodic
        - Observation space has two dimensions
        - the first Observation variable has integer values
        - the second Observation variable has float values
        - range of Observation variable one is unspecified
        - range of Observation variable two is unbounded
        - Action space has one dimension
        - the Action variable has integer values (discrete actions)
        - range of Action variable is 0 to 2
        - range of the Rewards is -1 to 0