🧠
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenberg–Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
    • Implicit Function Theorem
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • Diffusion Models
    • ImageGen
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  • Value Function Approximation
  • Stochastic Gradient Descent
  • Different Value approximation
  • Linear Function Approximation
  • Incremental Prediction Algorithms
  • Monte-Carlo with Value Funtoni Approx.
  • TD Learning with value funtion approx.
  • TD() with value funtion approximation.
  • Incremental Control Algorithms
  • Linear approximation:
  • Algorithms
  • Gradient Temporal-Difference Learning
  • Batch Methods
  • Least Square Prediction
  • Least Square Control
  1. Reinforcement Learning
  2. Deep Reinforcement Learning

Value Function Approximation

PreviousDeep Reinforcement LearningNextPolicy Gradient

Last updated 3 years ago

Value Function Approximation

SO far we have represented value function using look-up table: - Every state s has an entry V(s) - Every state-action pair have an entry Q(s,a)

Now it is impossible to have V(s) entry if we have large number of states. Hence, to solve this we use Value Function Approximator. Which are basically functoins which can map state to its value i.e v:s→av: s \rightarrow av:s→a .

Stochastic Gradient Descent

J(w)=Eπ[(vπ(s)−v^(s,w))2]J(w) = E_\pi[(v_\pi(s)-\hat{v}(s,w))^2]J(w)=Eπ​[(vπ​(s)−v^(s,w))2]

Here, vπv_\pivπ​ is the original taget and v^\hat{v}v^ is the value approximation function with parameters w. Now we will update the www using gradient descent in order to minimize the the cost J(w)J(w)J(w) . Hence

△w=α(vπ(s)−v^(s,w))▽wv^(s,w)\triangle w = \alpha (v_\pi(s)-\hat{v}(s,w))\triangledown_w\hat{v} (s,w)△w=α(vπ​(s)−v^(s,w))▽w​v^(s,w)

But is the problem that we don't know the target value funtion vπ(s)v_\pi(s)vπ​(s) in reinforcement learning before hand . We will see below how to handle this problem.

Different Value approximation

Linear Function Approximation

Feature Vector: We represent state using a vector as below.

Using this state feature vector and weights. Linear approximation is as follows:

Hence in this case update become:

Table Lookup Features: It is special case for linear function approximation. Here the state vector have all the states in itself. see below:

Incremental Prediction Algorithms

Monte-Carlo with Value Funtoni Approx.

Hence using linear Monte-carlo policy evaluation, algorithm will be:

Monte-Carlo evaluation will converge to a local optimum. Even with non-linear value funtion approximation.

TD Learning with value funtion approx.

Linear TD(0) coverges(close) to global optimum.

TD() with value funtion approximation.

Forward View linear TD():

Backward view linear TD():

Forward and backward view llinear TD() are equivalent.

Incremental Control Algorithms

Here

Linear approximation:

Algorithms

Same goes with forward and backward view TD(lambda).

Gradient Temporal-Difference Learning

Batch Methods

Least Square Prediction

SGD with experience replay

This converges to least square solution

Experience Replay in Deep Q-Networks

Least Square Control

Least square Action-Value Function Approximation

Approximate action-value funtion.

v^(s,w)=x(s)Tw=∑xj(s)wj\hat{v}(s,w) = x(s)^Tw = \sum x_j(s)w_jv^(s,w)=x(s)Tw=∑xj​(s)wj​
△w=α(vπ(s)−v^(s,w))x(s)\triangle w = \alpha (v_\pi(s)-\hat{v}(s,w))x(s)△w=α(vπ​(s)−v^(s,w))x(s)

Here we talk about the ways in which we can substitute the value of vπ(s)v_\pi(s)vπ​(s) from the the weights.

Use GtG_tGt​ calucuated from the episodes here in place of vπ(s)v_\pi(s)vπ​(s). We will have <(S1,G1),(S2,G2),...(ST,GT)><(S_1, G_1), (S_2, G_2),...(S_T, G_T)><(S1​,G1​),(S2​,G2​),...(ST​,GT​)> , which will be ou training data.

△w=α(Gt−v^(st,w))x(st)\triangle w = \alpha (G_t-\hat{v}(s_t,w))x(s_t)△w=α(Gt​−v^(st​,w))x(st​)

The TD-target Rt+1+γv^(St+1,w)R_{t+1} + γ v̂ (S_{t+1} , w)Rt+1​+γv^(St+1​,w) is a biased sample of true value vπ(st)v_\pi(s_t)vπ​(st​).

Supervised learning can be applies using traingin data: <(S1,R2+γv^(S2,w)),(S2,R3+γv^(S3,w)),...(ST−1,RT)><(S_1, R_2 + γ v̂ (S_2 , w)), (S_2, R_3 + γ v̂ (S_3 , w)),...(S_{T-1}, R_T)><(S1​,R2​+γv^(S2​,w)),(S2​,R3​+γv^(S3​,w)),...(ST−1​,RT​)>

△w=α(Rt+1+γv^(St+1,w)−v^(st,w))x(st)\triangle w = \alpha (R_{t+1} + γ v̂ (S_{t+1} , w)-\hat{v}(s_t,w))x(s_t)△w=α(Rt+1​+γv^(St+1​,w)−v^(st​,w))x(st​)

The λ-return GtλG_t^λGtλ​ is also a biased sample of true value vπ(st)v_\pi(s_t)vπ​(st​)

Apply Supervised learning with following data: <(S1,G1λ),(S2,G2λ),...(ST,GTλ)><(S_1, G_1^\lambda), (S_2, G_2^\lambda),...(S_T, G_T^\lambda)><(S1​,G1λ​),(S2​,G2λ​),...(ST​,GTλ​)>

△w=α(Gtλ−v^(st,w))x(st)\triangle w = \alpha (G_t^\lambda-\hat{v}(s_t,w))x(s_t)△w=α(Gtλ​−v^(st​,w))x(st​)
δt=Rt+1+γv^(St+1,w)−v^(st,w)Et=γλEt−1+x(St)△w=αδtEt\delta_t = R_{t+1} + γ v̂ (S_{t+1} , w)-\hat{v}(s_t,w)\\ E_t = \gamma \lambda E_{t-1} + x(S_t)\\ \triangle w = \alpha \delta_t E_tδt​=Rt+1​+γv^(St+1​,w)−v^(st​,w)Et​=γλEt−1​+x(St​)△w=αδt​Et​

Policy Evalutaion: Approximate policy evaluation: q^(.,.,w)≈qπ\hat{q}(.,.,w) \approx q_\piq^​(.,.,w)≈qπ​ Policy Improvement : ϵ−greedy\epsilon-greedy ϵ−greedy policy improvement

J(w)=Eπ[(qπ(S,A,w)−q^(S,A,w))2]J(w) = E_\pi[(q_\pi(S,A,w)-\hat{q}(S,A,w))^2]J(w)=Eπ​[(qπ​(S,A,w)−q^​(S,A,w))2]
△w=α(qπ(S,A,w)−q^(S,A,w))△wq^(S,A,w)\triangle w = \alpha (q_\pi(S,A,w)-\hat{q}(S,A,w))\triangle_w \hat{q}(S,A,w)△w=α(qπ​(S,A,w)−q^​(S,A,w))△w​q^​(S,A,w)
q^(s,a,w)=x(s,a)Tw=∑xj(s,a)wj\hat{q}(s,a,w) = x(s,a)^Tw = \sum x_j(s,a)w_jq^​(s,a,w)=x(s,a)Tw=∑xj​(s,a)wj​

Now we have to substitute a target for qπ(S,A)q_\pi(S,A)qπ​(S,A)

MC: The taget wiill be return GtG_tGt​

△w=α(Gt−q^(S,A,w))△wq^(S,A,w)\triangle w = \alpha (G_t-\hat{q}(S,A,w))\triangle_w \hat{q}(S,A,w)△w=α(Gt​−q^​(S,A,w))△w​q^​(S,A,w)

TD(0): The target is TD target Rt+1+γq^(St+1,At+1,w)R_{t+1} + γ \hat{q} (S_{t+1}, A_{t+1} , w)Rt+1​+γq^​(St+1​,At+1​,w)

△w=α(Rt+1+γq^(St+1,At+1,w)−q^(S,A,w))△wq^(S,A,w)\triangle w = \alpha (R_{t+1} + γ \hat{q} (S_{t+1}, A_{t+1} , w)-\hat{q}(S,A,w))\triangle_w \hat{q}(S,A,w)△w=α(Rt+1​+γq^​(St+1​,At+1​,w)−q^​(S,A,w))△w​q^​(S,A,w)

Given value function v^(s,w)\hat{v}(s,w)v^(s,w) approximation amd experience D consisting of <state, value> pairs. D={<s1,v1π>,<s2,v2π>,...<st,vtπ>}D = \{<s_1, v_1^\pi>,<s_2, v_2^\pi>,...<s_t, v_t^\pi>\}D={<s1​,v1π​>,<s2​,v2π​>,...<st​,vtπ​>}

Learning paramters w for the best fitting value funtionv^(s,w)\hat{v}(s,w)v^(s,w).

Least-square algorithm: find www minimizing the least square error between apprx funtion and target values.

LS(w)=∑t=1T(vtπ−v^(st,w))2=ED[(vπ−v^(s,w))2]LS(w) = \sum_{t=1}^T(v_t^\pi - \hat{v}(s_t,w))^2 \\ =E_D[(v^\pi-\hat{v}(s,w))^2]LS(w)=t=1∑T​(vtπ​−v^(st​,w))2=ED​[(vπ−v^(s,w))2]

Sample state, value from experience <s,vπ>∼D<s,v^\pi> \sim D<s,vπ>∼D

Apply SGD update △w=α(vπ−v^(s,w))▽wv^(s,w)\triangle w = \alpha (v^\pi-\hat{v}(s,w))\triangledown_w\hat{v}(s,w)△w=α(vπ−v^(s,w))▽w​v^(s,w)

wπ=argmin⁡wLS(w)w^\pi = arg \min_w LS(w)wπ=argwmin​LS(w)

Minimize least square error between q^(s,a,w)\hat{q}(s,a,w)q^​(s,a,w) and qπ(s,a,w){q}_\pi(s,a,w)qπ​(s,a,w) from experiences generated using policy π\piπ consisting of <(state, actoin), value> pairs D={<(s1,a1),v1π>,<(s2,a2),v2π>,...<(st,at),vtπ>}D = \{<(s_1,a_1), v_1^\pi>,<(s_2,a_2), v_2^\pi>,...<(s_t,a_t), v_t^\pi>\}D={<(s1​,a1​),v1π​>,<(s2​,a2​),v2π​>,...<(st​,at​),vtπ​>}

Sarsa Control with action-value funtion approximation
Deep Q network using experience replay