In this, the first equation represents the model, which is the joint probability distribution of reward and future state given current state and action taken. The reward given state and action taken can be calculated using summation over all possible values of r and weighing according to probability of occurring. Note: We can remove the double summation in calculating R(s,a) as R: SxAxS -> R i.e. it is completely determined by s,a,s'. Hence, we can remove the summation over r as once s,a,s' are fixed, we will have only one value of r to sum over.
In this, first ask why the Expectation is needed for fixed policy?
Ans: Even if the policy is fixed, it determines the probability of transition between states given current state and action taken. Now, if we want to calculate the return Gt we have to weight the reward from each state according to the the probability of transition and probability of each action that can be taken which is determined by policy. Hence, in order to weight the reward according to transition probability and policy, we have to take the expectation.
Note: E[Vπ(s‘)]=∑s′∈SPss′πVπ(s′), when using this equation we consider s′ is the random variable. But when E[Vπ(s‘)]=∑a∈A∑s′∈Sπ(a∣s)Pss′aVπ(s′) is used, then, a and s′ are random variables and summation is done relative to both.
π∗ is the optimal policy. This policy have the maximum value function for all states. Now there will always be a deterministic optimal policy for a MDP, which we can find out using the action-value function. The action for a state which provide the maximum value gets the probability 1 and other actions get probability 0.
The Bellman optimality equations are non-linear and no closed form solution exist like in the case of bellman expectation equations. Hence, it is difficult to solve for optimality exactly, hence we take use of other methods to find optimal state-value and action-value function.
So, replacing X by Gt∣St and Y by Atgives the same equation as that of Law of total expectation.
Make sure that you under the equation the V and Q equation without the law, it is just common sense same as used in law of total probability. It is just that you divide an event into independent conditioned events and weight those according to their probability of occurring.
Policy
Stationary: These policy depends only upon current state and doesn't change with time.
Non-Stationary: In these, policy may change with time, hence different action may be choosen at different time given the same current state.