Solving MDP with known Model
Planning by DP
When model is fully known, we use DP to evaluate the model and then improve the policy.
DP:
Dynamic Programming is a very general solution method for problems which have two properties:
Optimal substructure
Principle of optimality applies
Optimal solution can be decomposed into subproblems
Overlapping subproblems
Subproblems recur many times
Solutions can be cached and reused
Policy Evaluation:
Using synchronous backups, (backup means update)
At each iteration k + 1
For all states s ∈ S
Update from
where is a successor state of s
Policy Improvement
Based on the value functions, Policy Improvement generates a better policy π′≥ππ′≥π by acting greedily.
Policy Iteration
The Generalized Policy Iteration (GPI) algorithm refers to an iterative procedure to improve the policy when combining policy evaluation and improvement.
Given a policy :
Evaluate the policy using policy evaluation
Then perform policy improvement to improve the policy.
In case of deterministic policy we can do,
This improves the value from any state s over one step,
This will always lead to the optimal policy
Value Iteration
Any optimal policy can be subdivided into two components:
An optimal first action
Followed by an optimal policy from successor state S'
Deterministic Value Iteration
If we know the solution to subproblems
Then solution to can be found by one-step lookahead
we do this for every state until convergence
Start with the final rewards and work backwards
Using synchronous backups
At each iteration k + 1
For all states s ∈ S
Update from
Once we get the optimal value function, we can find the deterministic optimal policy using following:
Extension to DP:
Synchronous Synchronous value iteration stores two copies of value function
Asynchronous
In-place DP This one just one value
Prioritised sweeping
Real-time dynamic programming
Sample Backups Using sample rewards and sample transitions {S, A, R, S`}Instead of reward function R and transition dynamics P. Sample is the keyword here.
Convergence of Policy or Value Iteration.
This can be proved using contraction mapping theorem.
Last updated