Download
Here
First Author HomePage
Abstract:
We propose a new approach to the problem of searching a space of
policies for a Markov decision process (MDP) or a partially observable
Markov decision process (POMDP), given a model. Our approach is based
on the following observation: Any (PO)MDP can be transformed into an
“equivalent” POMDP in which all state transitions (given the current
state and action) are deterministic. This reduces the general problem
of policy search to one in which we need only consider POMDPs with
deterministic transitions. We give a natural way of estimating the
value of all policies in these transformed POMDPs. Policy search is
then simply performed by searching for a policy with high estimated
value. We also establish conditions under which our value estimates
will be good, recovering theoretical results similar to those of
Kearns, Mansour and Ng [7], but with “sample complexity” bounds that
have only a polynomial rather than exponential dependence on the
horizon time. Our method applies to arbitrary POMDPs, including ones
with infinite state and action spaces. We also present empirical
results for our approach on a small discrete problem, and on a complex
continuous state/continuous action problem involving learning to ride a
bicycle.
Keywords: reinforcement learning, policy
search, MDP, POMDP