> For the complete documentation index, see [llms.txt](https://theshank.gitbook.io/ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://theshank.gitbook.io/ai/reinforcement-learning/solving-mdp-with-known-model.md).

# Solving MDP with known Model

When model is fully known, we use DP to evaluate the model and then improve the policy.&#x20;

### DP:

Dynamic Programming is a very general solution method for problems which have two properties:&#x20;

* Optimal substructure
  * &#x20;Principle of optimality applies&#x20;
  * Optimal solution can be decomposed into subproblems&#x20;
* Overlapping subproblems&#x20;
  * Subproblems recur many times&#x20;
  * Solutions can be cached and reused

### Policy Evaluation:

* Using **synchronous backups**, (**backup means update**)
  * At each iteration k + 1&#x20;
  * For all states s ∈ S&#x20;
  * Update $$V\_{k+1}(s)$$ from $$V\_k(s')$$&#x20;
  * where $$s'$$ is a successor state of s

$$
V\_{k+1}(s) = E\[r+\gamma V\_k(s') | S\_t=s] = \sum\_a\pi(a|s)\sum\_{s',r}P(s',r|s,a)(r+\gamma V\_k(s'))
$$

### **Policy Improvement**

Based on the value functions, Policy Improvement generates a better policy π′≥ππ′≥π by acting greedily.

$$
Q\_π(s,a)=𝔼\[R\_{t+1}+γV\_π(S\_{t+1})|S\_t=s,A\_t=a]=∑\_{s′,r}P(s',r|s,a)(r+γV\_π(s'))
$$

$$
\pi' = greedy(v\_{\pi}) \\
\pi'(s) = arg \max\_{a \in A} q\_\pi(s,a)
$$

### Policy Iteration

The Generalized Policy Iteration (GPI) algorithm refers to an iterative procedure to improve the policy when combining policy evaluation and improvement.

Given a policy $$\pi$$ :

* Evaluate the policy  $$\pi$$ using **policy evaluation**
* Then perform policy improvement to improve the policy.\
  &#x20;$$\pi' = greedy(V\_\pi)$$&#x20;
  * In case of deterministic policy we can do, $$\pi'(s) = arg \max\_{a \in A} q\_\pi(s,a)$$&#x20;
  * This improves the value from any state s over one step,\
    &#x20;$$q\_\pi(s, \pi'(s)) = \max\_{a\in A} q\_\pi(s,a) \geq q\_\pi(s, \pi(s)) = v\_\pi(s)$$&#x20;

**This will always lead to the optimal policy**  $$\pi^\*$$

### **Value Iteration**

Any optimal policy can be subdivided into two components:&#x20;

* An optimal first action $$A\_\*$$&#x20;
* Followed by an optimal policy from successor state S'

#### Deterministic Value Iteration

* If we know the solution to subproblems $$v\_\*(s')$$&#x20;
* Then solution to can be found by one-step lookahead\
  &#x20;$$v\_*(s) = \max\_{a \in A}(R\_s^a + \gamma \sum\_{s' \in S}P\_{ss'}^aV\_*(s'))$$
* we do this for every state until convergence&#x20;
* Start with the final rewards and work backwards

Using **synchronous backups**&#x20;

* At each iteration k + 1&#x20;
* For all states s ∈ S&#x20;
* Update $$v\_{k+1}(s)$$ from $$v\_k(s')$$&#x20;

Once we get the optimal value function, we can find the deterministic optimal policy using following:

$$
\pi^\*(s) = arg \max\_{a \in A} \sum\_{s'\in S}P\_{ss'}^a
$$

### Extension to DP:

* Synchronous\
  Synchronous value iteration stores two copies of value function\
  $$v\_{new}(s) = \max\_{a \in A}(R\_s^a + \gamma \sum\_{s' \in S}P\_{ss'}^av\_{old}(s')) \v\_{old} = v\_{new}$$
* Asynchronous
  * In-place DP\
    This one just one value\
    $$v\_{*}(s) = \max\_{a \in A}(R\_s^a + \gamma \sum\_{s' \in S}P\_{ss'}^av\_{*}(s'))$$
  * Prioritised sweeping
  * Real-time dynamic programming
* Sample Backups\
  Using **sample rewards and sample transitions** {S, A, R, S\`}Instead of reward function R and transition dynamics P. **Sample is the keyword here.**&#x20;

### Convergence of Policy or Value Iteration.&#x20;

This can be proved using contraction mapping theorem.&#x20;


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://theshank.gitbook.io/ai/reinforcement-learning/solving-mdp-with-known-model.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
