Exploration world



Actions:
    - Up, Down, Right, Left {U,D,R,L}

Observations:
    - up, down, right, left {u,d,r,l}

State:
    - may be grid numbers
    - how hungry/thirsty the agent is
       - the agent has two meters for food and water, ranging from say 1-100. When either falls below 10, at each timestep the agent receives a reward of -1 (-2 if both are below 10). Stepping into a reservoir state restores the appropriate meter (say it goes up by 10 for every timestep spent in that square).
    - there could be fire grids, which give a negative reward for every timestep spent in them.
    - there are rewards scattered around the complicated area of the world, which are consumed (disappear) until the agent returns to the reservoir (which acts as a reset for the world)
    - to start with, we may only have one reservoir and consumable rewards




IDEAS:
    - get the agent to learn a model without reward and then give it tasks to test the strength of the model it learned (Cosmin)


board1
board2
board3
board4
board5