🧠
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenberg–Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
    • Implicit Function Theorem
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • Diffusion Models
    • ImageGen
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  • Policy Objective Function
  • Policy Gradient Algo
  • Policy Differentiation (Policy Gradient Theorem)
  • Policy Gradient Vs Maximum Likelihood
  • Problems with Policy Gradient: High Variance
  • Reducing the Variance
  • Analysing the Variance in Gradient
  • Policy Objective in term of Action-Value function
  • Policy Optimization
  • Monte-Carlo Policy Gradient
  • Actor-Critic Policy Gradient
  1. Reinforcement Learning
  2. Deep Reinforcement Learning

Policy Gradient

Directly parametrise the policy instead of value or action-value function and optimize it.

PreviousValue Function ApproximationNextAlgorithms

Last updated 7 months ago

In this we will directly optimize the policy without using the value function. We will use policy gradient to optimize the policy.

πθ(s,a)=P[a∣s,θ]\pi_\theta(s,a) = P[a|s,\theta]πθ​(s,a)=P[a∣s,θ]

Advantage:

  • Effective in high-dimensional or continous action spaces

  • Can learn stochastic policies

Disadvantages:

  • Typically converges to local rather global minima.

  • Evaluating a policy is typically ineffcient and high variance.

Policy Objective Function

We will optimise the parameters of policy funtiuon wrt to certain objective funtion J(θ)J(\theta)J(θ) , which are as follows:

  • d

  • d

  • Es1∼p(s1)[Vπ(s1)]E_{s_1 \sim p(s_1)}[V^\pi(s_1)]Es1​∼p(s1​)​[Vπ(s1​)]

Policy Gradient Algo

Policy Differentiation (Policy Gradient Theorem)

Finally, the gradient of of objecting funtion is written in terms of expectation of gradient of log of policy multiplied by total reward.

Note: Here the term inside the gradient will be zero 0 for deterministic policy. Hence for now this will work only for stochastic policy

Policy Gradient Vs Maximum Likelihood

Here you can see that the log term in policy gradient is similar to the maximum liklihood loss in the supervised learning.

So we can compute the policy gradient similary to the way we calculate the maximum liklihood gradient.

One thing to note here, here is that in policy gradient the log term is weighted by reward. Hence, it helps to weight the log term according to the goodness or badess of the trajectory which is defined by reward. Hence we can increase the liklihood of trajectory weighted by its reward. Reward help us to increase probabilities of certain policies whereas simply using maximum likelihood will increase probabilty of all policies. Hence the updated policy makes the trajectory with more reward more probable than the trajectory with less reward.

Problems with Policy Gradient: High Variance

  • This leads to poor convergence.

  • The video will explain, that this method is very senstive to the reward formulation. Just adding a constant to the reward function can affect the policy gradient very much.

Reducing the Variance

instead of using above formula which is orignial one, we will change it to below

Why does doing this reduces variance?: Simply beacuse the now we are adding less numbers in the sum of reward, hence this will simply reduce the variance. i.e. smaller numbers leads to smaller variance.

  • Using Baseline

The above formula is an unbiased estimater of the gradient.

Analysing the Variance in Gradient

Policy Objective in term of Action-Value function

Note: Here we are not considering discounted reward

Policy Optimization

Monte-Carlo Policy Gradient

This have high variance

Actor-Critic Policy Gradient

Actor-critic algorithms maintain two sets of parameters: - Critic Updates action-value function parameters w - Actor Updates policy parameters θ, in direction suggested by critic

Algorithm follow an approximate policy gradient:

The critic added is just solving the familiar problem of policy evaluation i.e how good is current policy learnt by the actor.

Action-Value(Q) Actor-Critic Algorithm

Algorithm:

Compatible Function Approximation

Advantage Function Actor Critic

Subtracting baseline function B(s) from the policy gradient reduces variance without changing expectation.

Estimating the Advantage Function:

Eligibility Traces

Natural Policy Gradient

Here θ\thetaθ parametrize the policy. The green box have a model which can estimate the return. Blue box use gradient descent to update the policy using the estimated return. Now run the updated policy to generate more samples.

In above image we are calculting the gradient of objective function. Note the the gradient is calculated in term of gradient of log of probability of trajectory and total reward of the trajectory. Note: Here is πθ\pi_\thetaπθ​ does not mean policy but the probability distribution. πθ(τ)=pθ(τ)\pi_\theta(\tau) = p_\theta(\tau)πθ​(τ)=pθ​(τ)

▽θJ(θ)=Eτ∼πθ(τ)[▽θlog⁡pθ(τ)r(τ)]=Eτ∼pθ(τ)[(∑t▽θlog⁡πθ(at∣st))(∑tr(st,at))]\triangledown_\theta J(\theta) = E_{\tau \sim \pi_\theta(\tau)}[\triangledown_\theta \log p_\theta(\tau)r(\tau)] \\ = E_{\tau \sim p_\theta(\tau)}[(\sum_t\triangledown_\theta \log \pi_\theta(a_t|s_t))(\sum_t r(s_t, a_t))] ▽θ​J(θ)=Eτ∼πθ​(τ)​[▽θ​logpθ​(τ)r(τ)]=Eτ∼pθ​(τ)​[(t∑​▽θ​logπθ​(at​∣st​))(t∑​r(st​,at​))]

The gradient above is which is estimated by Eτ∼πθ(τ)[▽θlog⁡pθ(τ)r(τ)]E_{\tau \sim \pi_\theta(\tau)}[\triangledown_\theta \log p_\theta(\tau)r(\tau)]Eτ∼πθ​(τ)​[▽θ​logpθ​(τ)r(τ)] have high variance. Meaning if we estimate this quantity with finite number of samples and you repeat this process many times, everytime you reevaluate the graient, you will get different estimate and these estimates will be different from each other.

Why do we have high varinace? Explanation:

Policy at time t′t't′ should not affect the reward at time ttt if t′>tt'>tt′>t. To make sure that our objective follows this law if make sure some changes in the formula for policy gradient.

▽θJ(θ)=1N∑i=1N∑t=1T▽θlog⁡πθ(ati∣sti)(∑t=1Tr(sti,ati))\triangledown_\theta J(\theta) = \frac{1}{N}\sum_{i=1}^ N\sum_{t=1}^T\triangledown_\theta \log \pi_\theta(a_t^i|s_t^i)(\sum_{t=1}^T r(s_t^i, a_t^i))▽θ​J(θ)=N1​i=1∑N​t=1∑T​▽θ​logπθ​(ati​∣sti​)(t=1∑T​r(sti​,ati​))
▽θJ(θ)=1N∑i=1N∑t=1T▽θlog⁡πθ(ati∣sti)(∑t′=tTr(sti,ati))\triangledown_\theta J(\theta) = \frac{1}{N}\sum_{i=1}^ N\sum_{t=1}^T\triangledown_\theta \log \pi_\theta(a_t^i|s_t^i)(\sum_{t'=t}^T r(s_t^i, a_t^i))▽θ​J(θ)=N1​i=1∑N​t=1∑T​▽θ​logπθ​(ati​∣sti​)(t′=t∑T​r(sti​,ati​))

This ensures that policy at time ttt is affects only the rewards which comes after time ttt for every trajectory collected. i.e the decision I take now can only change my future reward and not the past ones. Hence while updating the gradient, we only consider the rewards after time ttt to update policy at time t. The above equation reduces to:

▽θJ(θ)=1N∑i=1N∑t=1T▽θlog⁡πθ(ati∣sti)Qi,t\triangledown_\theta J(\theta) = \frac{1}{N}\sum_{i=1}^ N\sum_{t=1}^T\triangledown_\theta \log \pi_\theta(a_t^i|s_t^i)Q_{i,t}▽θ​J(θ)=N1​i=1∑N​t=1∑T​▽θ​logπθ​(ati​∣sti​)Qi,t​

What above formula do is, like if all the rewards are positive then all the probabilities will go up. So instead of multiply log⁡π\log \pilogπ by sum of rewards we should multiply it by the reward minus the average reward. This will make things more rewarding than usual to go up and things less rewarding than usual go down. Something better than expected will be more likely and worse than expected will be less likely.

▽θJ(θ)=1N∑i=1N▽θlog⁡πθ(τ)(r(τ)−b)b=1N∑iNr(τ)\triangledown_\theta J(\theta) = \frac{1}{N}\sum_{i=1}^ N\triangledown_\theta \log \pi_\theta(\tau)(r(\tau)-b) \\ b = \frac{1}{N}\sum_i^Nr(\tau)▽θ​J(θ)=N1​i=1∑N​▽θ​logπθ​(τ)(r(τ)−b)b=N1​i∑N​r(τ)

Link: The above part states that in finite horizon problme, a optimal policy is a time variant policy i.e it changes with time. But when using neural network we restrict our policy class to be time invarient policy as same network is used at every time step.

Here we write the policy optimization objective in terms of action-value of S1S_1S1​ .

In simpler terms our RL objective is: Es1∼p(s1)[Vπ(s1)]E_{s_1 \sim p(s_1)}[V^\pi(s_1)]Es1​∼p(s1​)​[Vπ(s1​)]

Find θ\thetaθ which maximizes J(θ)J(\theta)J(θ) .

△θ=α▽θJ(θ)\triangle \theta = \alpha \triangledown_\theta J(\theta)△θ=α▽θ​J(θ)

where ▽θJ(θ)\triangledown_\theta J(\theta)▽θ​J(θ) is the policy gradient

use a critic to estimate the action-value function, Qw(s,a)≈Qπθ(s,a)Q_w(s,a) \approx Q^{\pi_\theta}(s,a)Qw​(s,a)≈Qπθ​(s,a)

▽θJ(θ)≈Eπθ[▽θlog⁡πθ(s,a)Qw(s,a)]△θ=α▽θlog⁡πθ(s,a)Qw(s,a)\triangledown_\theta J(\theta) \approx E_{\pi_\theta}[\triangledown_\theta \log\pi_\theta(s,a)Q_w(s,a)] \\ \triangle\theta = \alpha \triangledown_\theta\log\pi_\theta(s,a)Q_w(s,a)▽θ​J(θ)≈Eπθ​​[▽θ​logπθ​(s,a)Qw​(s,a)]△θ=α▽θ​logπθ​(s,a)Qw​(s,a)

Using linear value fn approx. Qw(s,a)=ϕ(s,a)TwQ_w(s,a) = \phi(s,a)^TwQw​(s,a)=ϕ(s,a)Tw - Critic updates w by linear TD(0) - Actor Updates θ\thetaθ by policy gradient

A good baseline function is state value function B(s) = V(s). Hence we rewrite the policy gradient using the advantage function Aπθ(s,a)A^{\pi_\theta}(s,a)Aπθ​(s,a)

Aπθ(s,a)=Qπθ(s,a)−Vπθ(s,a)▽θJ(θ)=Eπθ[▽θlog⁡πθ(s,a)Aπθ(s,a)]A^{\pi_\theta}(s,a) = Q^{\pi_\theta}(s,a) - V^{\pi_\theta}(s,a) \\ \triangledown_\theta J(\theta) = E_{\pi_\theta}[\triangledown_\theta \log\pi_\theta(s,a)A^{\pi_\theta}(s,a)]Aπθ​(s,a)=Qπθ​(s,a)−Vπθ​(s,a)▽θ​J(θ)=Eπθ​​[▽θ​logπθ​(s,a)Aπθ​(s,a)]
https://youtu.be/XGmd3wcyDg8?list=PLkFD6_40KJIxJMR-j5A1mkxK26gh_qg37&t=1690
https://youtu.be/XGmd3wcyDg8?list=PLkFD6_40KJIxJMR-j5A1mkxK26gh_qg37