MDP Equations

About the value and action-value equations

Relation between state, observation and action in temporal domain

Episode: (s,a,s`,r) Let’s say when we are in state s, we decide to take action a to arrive in the next state s’ and obtain reward r.

Model: Transition and Reward

P(s,r    s,a)=P[St+1=s,Rt+1=r    St=s,At=a]Pssa=P[St+1=s    St=s,At=a]=rRP(s,r    s,a)R(s,a)=E[Rt+1St=s,At=a]=rRsSP(s,r    s,a)×r=sSP(s,r    s,a)×R(s,a,s)P(s^`, r\;|\;s,a) = P[S_{t+1}=s^`, R_{t+1}=r\;|\;S_t = s, A_t=a]\\ P_{ss^`}^a = P[S_{t+1}=s^`\;|\; S_t=s, A_t=a] = \sum_{r \in R}P(s^`, r\;|\;s,a)\\ R(s,a) = E[R_{t+1}|S_t=s, A_t=a] = \sum_{r \in R}\sum_{s^` \in S}P(s^`, r\;|\;s,a)\times r\\ = \sum_{s^` \in S}P(s^`, r\;|\;s,a)\times R(s,a,s`)

In this, the first equation represents the model, which is the joint probability distribution of reward and future state given current state and action taken. The reward given state and action taken can be calculated using summation over all possible values of r and weighing according to probability of occurring. Note: We can remove the double summation in calculating R(s,a) as R: SxAxS -> R i.e. it is completely determined by s,a,s'. Hence, we can remove the summation over r as once s,a,s' are fixed, we will have only one value of r to sum over.

π(as)=P[At=aSt=s]Rπ(s)=aAπ(as)R(s,a)Pssπ=aAPssˊaπ(as)\pi(a|s) = P[A_t = a| S_t=s]\\ R_\pi (s) = \sum_{a\in A}\pi (a|s)R(s,a)\\ P_{ss^`}^\pi = \sum_{a\in A}P_{s\acute{s}}^a\pi(a|s)

The above equation is proved using law of total probability.

Lets start with value and action-value function for RL

Gt=Rt+1+γRt+2+...        1Vπ(s)=E[GtSt=s]                        2Qπ(s,a)=E[GtSt=s,At=a]    3G_t = R_{t+1} + \gamma R_{t+2} + ...\;\;\;\;1\\ V_\pi (s)= E[G_t | S_t = s] \;\;\;\;\;\;\;\;\;\;\;\; 2\\ Q_\pi (s,a)= E[G_t|S_t =s, A_t=a] \;\;3

In this, first ask why the Expectation is needed for fixed policy? Ans: Even if the policy is fixed, it determines the probability of transition between states given current state and action taken. Now, if we want to calculate the return GtG_t we have to weight the reward from each state according to the the probability of transition and probability of each action that can be taken which is determined by policy. Hence, in order to weight the reward according to transition probability and policy, we have to take the expectation.

Bellman Equation

Vπ=aAπ(as)Qπ(s,a)Qπ(s,a)=Rsa+γsSPssaVπ(s)V_\pi = \sum_{a\in A}\pi(a|s)Q_\pi (s,a)\\ Q_\pi (s,a) = R_s^a + \gamma \sum_{s^`\in S}P_{ss^`}^aV_\pi(s^`)\\

These equation are derived using law of total expectation or law of iterative expectation. Click here to see the derivation.

On solving

Vπ(s)=aAπ(as)Qπ(s,a)=aAπ(as)(Rsa+γsSPssaVπ(s))=aAπ(as)Rsa  +γsSaAπ(as)PssaVπ(s))=Rsπ+γsSPssπVπ(s)=Rsπ+γE[Vπ(s)]V_\pi(s) = \sum_{a\in A}\pi(a|s)Q_\pi (s,a)\\ = \sum_{a\in A}\pi(a|s)(R_s^a + \gamma \sum_{s^`\in S}P_{ss^`}^aV_\pi(s^`))\\ = \sum_{a\in A}\pi(a|s)R_s^a \;+ \gamma\sum_{s^`\in S}\sum_{a\in A}\pi(a|s) P_{ss^`}^aV_\pi(s^`))\\ = R_s^\pi + \gamma\sum_{s^`\in S}P_{ss^`}^\pi V_\pi(s^`)\\ = R_s^\pi + \gamma E[V_\pi(s^`)]\\

Note: E[Vπ(s)]=sSPssπVπ(s)E[V_\pi(s^`)] = \sum_{s'\in S}P_{ss'}^\pi V_\pi(s'), when using this equation we consider ss' is the random variable. But when E[Vπ(s)]=aAsSπ(as)PssaVπ(s)E[V_\pi(s^`)] = \sum_{a \in A}\sum_{s'\in S}\pi(a|s) P_{ss'}^a V_\pi(s') is used, then, a and sa\text{ and } s' are random variables and summation is done relative to both.

Qπ(s,a)=Rsa+γsSPssaVπ(s)=Rsa+γsSPssaaAπ(as)Qπ(s,a)=Rsa+γsSaAPssa×π(as)×Qπ(s,a)=Rsa+γE[Qπ(s,a)]Q_\pi(s,a) = R_s^a + \gamma\sum_{s' \in S}P_{ss'}^aV_\pi(s')\\ = R_s^a + \gamma\sum_{s' \in S}P_{ss'}^a\sum_{a' \in A}\pi(a'|s')Q_\pi(s', a')\\ = R_s^a + \gamma\sum_{s' \in S}\sum_{a' \in A}P_{ss'}^a\times\pi(a'|s')\times Q_\pi(s', a')\\ = R_s^a + \gamma E[Q_\pi(s',a')]

Optimality

π=argmaxπ  Vπ(s)=argmaxπ  Qπ(s,a)    s,aVπ=maxπVπ(s)      sQπ=maxπQπ(s,a)    s\pi^* = \text{arg}\max_\pi\;V_\pi(s) = \text{arg}\max_\pi \;Q_\pi(s,a) \;\;\forall s,a\\ V_{\pi^*} = \max_\pi V_\pi(s)\;\;\; \forall s\\ Q_{\pi^*} = \max_\pi Q_\pi(s,a)\;\; \forall s

π\pi^* is the optimal policy. This policy have the maximum value function for all states. Now there will always be a deterministic optimal policy for a MDP, which we can find out using the action-value function. The action for a state which provide the maximum value gets the probability 1 and other actions get probability 0.

Bellman Optimality equations

V(s)=maxaQ(s,a)Q(s,a)=Rsa+γsSPssaV(s)V_*(s) = \max_a Q_*(s,a)\\ Q_*(s,a) = R_s^a + \gamma \sum_{s^`\in S}P_{ss_`}^a V_*(s^`)

The Bellman optimality equations are non-linear and no closed form solution exist like in the case of bellman expectation equations. Hence, it is difficult to solve for optimality exactly, hence we take use of other methods to find optimal state-value and action-value function.

Law of Iterative Expectation

EY[E[XY]]=yYE[XY=y]PY(y)=yY(xXxPX(xY=y))PY(y)=xXx(yYPX(xY=y)PY(y))=xXxPX(x)=E[X]E_Y[E[X|Y]] = \sum_{y\in Y}E[X|Y=y]P_Y(y) = \sum_{y\in Y}\left(\sum_{x\in X}xP_X(x|Y=y)\right)P_Y(y)\\ = \sum_{x\in X}x\left(\sum_{y\in Y}P_X(x|Y=y)P_Y(y)\right) =\sum_{x\in X}xP_X(x)\\ = E[X]

Now see, how this is used to establish the relationship between state value and action value function.

V(s)=E[GtSt=s]Q(s,a)=E[GtSt=s,At=a]V(s) = E[G_t|S_t=s]\\ Q(s,a) = E[G_t|S_t=s, A_t=a]\\
V(s)=aAπ(as)Qπ(s,a)E[GtSt=s]=aAP(aSt=s)E[GtSt=s,At=a]E[X]=yYPY(y)E[XY]E[X]=EY[E[XY]]V(s) = \sum_{a \in A}\pi(a|s)Q_\pi(s,a)\\ E[G_t|S_t=s] = \sum_{a\in A}P(a|S_t=s)E[G_t|S_t=s, A_t=a]\\ E[X] = \sum_{y \in Y}P_Y(y)E[X|Y]\\ E[X] = E_Y[E[X|Y]]

So, replacing XX by GtStG_t|S_t and YY by AtA_t gives the same equation as that of Law of total expectation.

Make sure that you under the equation the V and Q equation without the law, it is just common sense same as used in law of total probability. It is just that you divide an event into independent conditioned events and weight those according to their probability of occurring.

Policy

Stationary: These policy depends only upon current state and doesn't change with time. Non-Stationary: In these, policy may change with time, hence different action may be choosen at different time given the same current state.

Last updated