RLAI Reinforcement Learning and Artificial Intelligence (RLAI)
Price, B. and Boutilier, C. "Accelerating Reinforcement Learning through Implicit Imitation",  Journal of Artificial Intelligence Research  (JAIR) 2003, Volume 19, pages 569-629
 
Author: Anna October, 2004
Download Here
First Author HomePage


Abstract:

      
Imitation can be viewed as a means of enhancing learning in multi agent environments. It augments an agent's ability to learn useful behaviors by making intelligent use of the knowledge implicit in behaviors demonstrated by cooperative teachers or other more experienced agents. We propose and study a formal model of implicit imitation that can accelerate reinforcement learning dramatically in certain cases. Roughly, by observing a mentor, a reinforcement-learning agent can extract information about its own capabilities in, and the relative value of, unvisited parts of the state space. We study two specific instantiations of this model, one in which the learning agent and the mentor have identical abilities, and one designed to deal with agents and mentors with different action sets. We illustrate the benefits of implicit imitation by integrating it with prioritized sweeping, and demonstrating improved performance and convergence through observation of single and multiple mentors. Though we make some stringent assumptions regarding observability and possible interactions, we briefly comment on extensions of the model that relax these restrictions.

Keywords:
Reinforcement Learning, Implicit imitation, Markov Decision Processes, Approximate Policy Iteration.
 

Bibtex:

@article{IIBob03,
    Author = {Bob  Price and Craig Boutilier}.
   
    Journal = {Journal of Artificial Intelligence Research},
    Pages = {569--629},
    Title = {Accelerating Reinforcement Learning Through Implicit Imitation},
    Volume = {19},
    Year = {2003}}


Comments:

* I think having the mentor's trajectories is more resonable and practical compared to having values learned by him.

* Having the reward function by the agent is a huge assumption. I like to see the part where this constraint would be relaxed. Why this is more natural ?

* The reward function can also be learned, but I think the reason they did not use it, is because of the the use of Dirichlet model.

* I agree that providing the actions taken by the mentor is a more rigid assumption, but in practice it is not that hard to save them.

* Equ. 7  can be easily extended to the case where actions of the mentor are available.

* Example of the case where mentor causes the agent to overestimates the values?
Answer: Stochastic environment where there is a slight chance for the agent get something really good from one state, and the mentor luckily only saw that transition.

*  Using confidence intervals makes the algorithm more reliable to not fall for mistakes advised by the possible mentors, but this will also put more computational process on the agent especially if we want to consider the case that we do not have the reward function. It also needs the new parameter c to be tuned!

* In Equ. 10 what if the mentor never visited certain states from one state ?

* It is important to not use optimistic initialization when using imitation methods!

* What is the epsilon-greedy definition in this paper?

* Basically either the agent has the same action as the mentor or not, but the idea of having a smooth decision criterion might help in cases that actions are similar but not exactly the same ... (Alborz)