Exploration world

Actions:
- Up, Down, Right, Left {U,D,R,L}
Observations:
- up, down, right, left {u,d,r,l}
State:
- may be grid numbers
- how hungry/thirsty the agent is
- the agent has two meters for food and
water, ranging from say 1-100. When either falls below 10, at each
timestep the agent receives a reward of -1 (-2 if both are below 10).
Stepping into a reservoir state restores the appropriate meter (say it
goes up by 10 for every timestep spent in that square).
- there could be fire grids, which give a negative reward for every timestep spent in them.
- there are rewards scattered around the complicated
area of the world, which are consumed (disappear) until the agent
returns to the reservoir (which acts as a reset for the world)
- to start with, we may only have one reservoir and consumable rewards
IDEAS:
- get the agent to learn a model without reward and
then give it tasks to test the strength of the model it learned (Cosmin)
board1
board2
board3
board4
board5