Solving MDP with known Model
Planning by DP
When model is fully known, we use DP to evaluate the model and then improve the policy.
DP:
Dynamic Programming is a very general solution method for problems which have two properties:
Optimal substructure
Principle of optimality applies
Optimal solution can be decomposed into subproblems
Overlapping subproblems
Subproblems recur many times
Solutions can be cached and reused
Policy Evaluation:
Using synchronous backups, (backup means update)
At each iteration k + 1
For all states s ∈ S
Update Vk+1(s) from Vk(s′)
where s′ is a successor state of s
Policy Improvement
Based on the value functions, Policy Improvement generates a better policy π′≥ππ′≥π by acting greedily.
Policy Iteration
The Generalized Policy Iteration (GPI) algorithm refers to an iterative procedure to improve the policy when combining policy evaluation and improvement.
Given a policy π :
Evaluate the policy π using policy evaluation
Then perform policy improvement to improve the policy. π′=greedy(Vπ)
In case of deterministic policy we can do, π′(s)=argmaxa∈Aqπ(s,a)
This improves the value from any state s over one step, qπ(s,π′(s))=maxa∈Aqπ(s,a)≥qπ(s,π(s))=vπ(s)
This will always lead to the optimal policy π∗
Value Iteration
Any optimal policy can be subdivided into two components:
An optimal first action A∗
Followed by an optimal policy from successor state S'
Deterministic Value Iteration
If we know the solution to subproblems v∗(s′)
Then solution to can be found by one-step lookahead v∗(s)=maxa∈A(Rsa+γ∑s′∈SPss′aV∗(s′))
we do this for every state until convergence
Start with the final rewards and work backwards
Using synchronous backups
At each iteration k + 1
For all states s ∈ S
Update vk+1(s) from vk(s′)
Once we get the optimal value function, we can find the deterministic optimal policy using following:
Extension to DP:
Synchronous Synchronous value iteration stores two copies of value function vnew(s)=maxa∈A(Rsa+γ∑s′∈SPss′avold(s′))vold=vnew
Asynchronous
In-place DP This one just one value v∗(s)=maxa∈A(Rsa+γ∑s′∈SPss′av∗(s′))
Prioritised sweeping
Real-time dynamic programming
Sample Backups Using sample rewards and sample transitions {S, A, R, S`}Instead of reward function R and transition dynamics P. Sample is the keyword here.
Convergence of Policy or Value Iteration.
This can be proved using contraction mapping theorem.
Last updated