Hybrid POMDP Planner -- Combining Offline and Online Techniques.
Eddy C. Borera
Our planner takes advantages of offline and online techniques. First,
a point-based technique is used to compute value functions for limited
numbers of belief states. This step is done offline, and the
resulting value functions are used to guide online sample-based
techniques. Also, our technique learns values for sampled belief
states overtime to improve the value functions. We have applied
Symbolic Perseus by Poupart et al., which is based on the original
point-based technique by Spaan et al. (Spaan and Vlassis 2005) to
compute initial approximate action values. Our technique is similar to
RTDP-Bel (Bonet and Geffner 1998) online technique, except, belief
states are discretized differently. We use {K-Nearest} Neighbor to
search for a suitable belief state, and its value is updated
accordingly. During atrial, values for encountered belief states are
stored in hashtables, and can be reused for future trials. This
learning should improve the action values overtime. Also, instead of
computing P(z | b, a) at each time step, observation sampling is used
to compute for approximate values. This reduces the online computation
time, especially for large problems.
During the computation of a Q-value, if a belief state has been
visited, then, its stored value is used. However, if it has not been
seen before, it is given the average value of the k-closest belief
states as initial value. For large problems, computing policy, even
for small set of belief states, is infeasible. Instead, our technique
learns the belief state values online through trials, and the
resulting best action is used for a given current belief state.