Deep Reinforcement Learning
Last updated
Last updated
As deep RL have parameters , hence our objective is to get such that reward of each state in an episode is maximized.
Note: Here we are not considering discounted reward
Once you are in a stationary distribution, you will remain in a stationary distribution. Stationery distribution is a vector conataining probabilities of being in state for each state in corresponding element. Seeing the equation above, eigen vector of a transition matrix P is always an stationary distribution.
Taking expectation of a function makes it smoother, allowing it differential wrt to the parameter. Look at example above. The reward function above is non-differentiable, but expectation makes it differentiable, making gradient based learning feasible for RL.
Off-Policy: Able to improve the policy without generating new samples from that policy On-policy: each time the policy is changed, even a little bit, we need to generate new samples
Conventional Policy Gradient Methods are on-Policy. New samples are generated each time with a updated policy.
Actor-Critic method can either be on-policy or off-policy depending upon the details
Model based are more efficient as it is intuitive itself, because having a model can reduce the need of samples.
Model-based RL method are gradient descent to get the best model. i.e. we are fit the for the model but nowhere maximizing the reward.
A stationary distribution of a Markov chain is a probability distribution that remains unchanged in the Markov chain as time progresses. Typically, it is represented as a row vector π whose entries are probabilities summing to 1, and given transition matrix , it satisfies