Deep Reinforcement Learning

Goal of Deep RL

As deep RL have parameters θ\theta , hence our objective is to get θ\theta such that reward of each state in an episode is maximized.

Note: Here we are not considering discounted reward

Finite & Infinite Horizon Objective Function

Stationary Distribution: Markov Chains

A stationary distribution of a Markov chain is a probability distribution that remains unchanged in the Markov chain as time progresses. Typically, it is represented as a row vector π whose entries are probabilities summing to 1, and given transition matrix P\textbf{P} , it satisfies

Ï€=Ï€P\pi = \pi \textbf{P}

Once you are in a stationary distribution, you will remain in a stationary distribution. Stationery distribution is a vector conataining probabilities of being in state for each state in corresponding element. Seeing the equation above, eigen vector of a transition matrix P is always an stationary distribution.

Expectation makes objective function smooth

Taking expectation of a function makes it smoother, allowing it differential wrt to the parameter. Look at example above. The reward function above is non-differentiable, but expectation makes it differentiable, making gradient based learning feasible for RL.

Types of RL Algos

Model-Based RL Algo

Value-Based RL Algo

Direct Policy Gradients

Actor-Critic

Trade-Offs

Sample Efficiency

Off-Policy: Able to improve the policy without generating new samples from that policy On-policy: each time the policy is changed, even a little bit, we need to generate new samples

  • Conventional Policy Gradient Methods are on-Policy. New samples are generated each time with a updated policy.

  • Actor-Critic method can either be on-policy or off-policy depending upon the details

  • Model based are more efficient as it is intuitive itself, because having a model can reduce the need of samples.

Convergence and Stability

  • Model-based RL method are gradient descent to get the best model. i.e. we are fit the for the model but nowhere maximizing the reward.

Last updated