1 OR II GSLM 52800. 2 3 Policy and Action  policy  the rules to specify what to do for all states  action  what to do at a state as dictated by the.

Similar presentations

Presentation on theme: "1 OR II GSLM 52800. 2 3 Policy and Action  policy  the rules to specify what to do for all states  action  what to do at a state as dictated by the."— Presentation transcript:

3
3 Policy and Action  policy  the rules to specify what to do for all states  action  what to do at a state as dictated by the policy  examples  policy: replacement only at state 3  do nothing at states 0, 1, and 2, replacing at state 3  policy: overhaul at state 2 and replacement at state 3  do nothing at state 0 and 1, overhaul at state 2, and replace at state 3

4
4 Expected Reward  p ij (k) = the probability of changing from state i to state j when action k is taken  q ij (k) = expected cost at state i when action k is taken and the state changes to j  C ik = the expected cost at state i with action k i j p ij (k)

5
5 Definition of Variables  policy R  g(R) = the long-term average cost per unit time of policy R  objective: finding the policy that minimizes g .. ..  v i (R) = the effect on the total expected cost when adopting policy R and starting at state i

7
7 Key Result in Policy Improvement  M+1 equations, M+2 unknowns  g(R) = the long-term average cost of policy R  v i (R) = the effect on the total expected cost when adopting policy R and starting at state i

8
8 Idea of Policy Improvement  the collection of v i (R) does not change by adding a constant  v i (R) = v i +c  the set of equations can be solved by arbitrarily setting v M (R) = 0

9
9 Idea of Policy Improvement  given policy R with action k, suppose that there exists policy R o with action k o such that  then it can be shown that g(R o ) < g(R)

10
10 Policy Improvement  1  Value Determination: Fix policy R. Set v M (R) to 0 and solve  2  Policy Improvement: For each state i, find action k as argument minimum of  3  Form a new policy from actions in 2 . Stop if this policy is the same as R; else go to 1 

11
11 Idea of Policy Improvement  it can be proven that  g is non-increasing  R is minimum if there is no change in policy  the algorithm stops after finite number of iterations

17
17 Example  Iteration 2:  Value Determination It can be shown that there is no improvement in policy so that doing nothing at states 0 and 1, overhauling at state 2, and replacing at state 3 is an optimum policy