As deep RL have parameters θ , hence our objective is to get θ such that reward of each state in an episode is maximized.
Note: Here we are not considering discounted reward
Finite & Infinite Horizon Objective Function
Stationary Distribution: Markov Chains
A stationary distribution of a Markov chain is a probability distribution that remains unchanged in the Markov chain as time progresses. Typically, it is represented as a row vector π whose entries are probabilities summing to 1, and given transition matrixP , it satisfies
Ï€=Ï€P
Once you are in a stationary distribution, you will remain in a stationary distribution. Stationery distribution is a vector conataining probabilities of being in state for each state in corresponding element.
Seeing the equation above, eigen vector of a transition matrix P is always an stationary distribution.
Expectation makes objective function smooth
Taking expectation of a function makes it smoother, allowing it differential wrt to the parameter. Look at example above. The reward function above is non-differentiable, but expectation makes it differentiable, making gradient based learning feasible for RL.
Types of RL Algos
Model-Based RL Algo
Value-Based RL Algo
Direct Policy Gradients
Actor-Critic
Trade-Offs
Sample Efficiency
Off-Policy: Able to improve the policy without generating new samples from that policy
On-policy: each time the policy is changed, even a little bit, we need to generate new samples
Conventional Policy Gradient Methods are on-Policy. New samples are generated each time with a updated policy.
Actor-Critic method can either be on-policy or off-policy depending upon the details
Model based are more efficient as it is intuitive itself, because having a model can reduce the need of samples.
Convergence and Stability
Model-based RL method are gradient descent to get the best model. i.e. we are fit the for the model but nowhere maximizing the reward.