🧠
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenberg–Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
    • Implicit Function Theorem
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • Diffusion Models
    • ImageGen
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  • DP:
  • Policy Evaluation:
  • Policy Improvement
  • Policy Iteration
  • Value Iteration
  • Extension to DP:
  • Convergence of Policy or Value Iteration.
  1. Reinforcement Learning

Solving MDP with known Model

Planning by DP

When model is fully known, we use DP to evaluate the model and then improve the policy.

DP:

Dynamic Programming is a very general solution method for problems which have two properties:

  • Optimal substructure

    • Principle of optimality applies

    • Optimal solution can be decomposed into subproblems

  • Overlapping subproblems

    • Subproblems recur many times

    • Solutions can be cached and reused

Policy Evaluation:

  • Using synchronous backups, (backup means update)

    • At each iteration k + 1

    • For all states s ∈ S

    • Update Vk+1(s)V_{k+1}(s)Vk+1​(s) from Vk(s′)V_k(s')Vk​(s′)

    • where s′s's′ is a successor state of s

Vk+1(s)=E[r+γVk(s′)∣St=s]=∑aπ(a∣s)∑s′,rP(s′,r∣s,a)(r+γVk(s′))V_{k+1}(s) = E[r+\gamma V_k(s') | S_t=s] = \sum_a\pi(a|s)\sum_{s',r}P(s',r|s,a)(r+\gamma V_k(s'))Vk+1​(s)=E[r+γVk​(s′)∣St​=s]=a∑​π(a∣s)s′,r∑​P(s′,r∣s,a)(r+γVk​(s′))

Policy Improvement

Based on the value functions, Policy Improvement generates a better policy π′≥ππ′≥π by acting greedily.

Qπ(s,a)=𝔼[Rt+1+γVπ(St+1)∣St=s,At=a]=∑s′,rP(s′,r∣s,a)(r+γVπ(s′))Q_π(s,a)=𝔼[R_{t+1}+γV_π(S_{t+1})|S_t=s,A_t=a]=∑_{s′,r}P(s',r|s,a)(r+γV_π(s'))Qπ​(s,a)=E[Rt+1​+γVπ​(St+1​)∣St​=s,At​=a]=s′,r∑​P(s′,r∣s,a)(r+γVπ​(s′))
π′=greedy(vπ)π′(s)=argmax⁡a∈Aqπ(s,a)\pi' = greedy(v_{\pi}) \\ \pi'(s) = arg \max_{a \in A} q_\pi(s,a)π′=greedy(vπ​)π′(s)=arga∈Amax​qπ​(s,a)

Policy Iteration

The Generalized Policy Iteration (GPI) algorithm refers to an iterative procedure to improve the policy when combining policy evaluation and improvement.

Given a policy π\piπ :

  • Evaluate the policy π\piπ using policy evaluation

  • Then perform policy improvement to improve the policy. π′=greedy(Vπ)\pi' = greedy(V_\pi)π′=greedy(Vπ​)

    • In case of deterministic policy we can do, π′(s)=argmax⁡a∈Aqπ(s,a)\pi'(s) = arg \max_{a \in A} q_\pi(s,a)π′(s)=argmaxa∈A​qπ​(s,a)

    • This improves the value from any state s over one step, qπ(s,π′(s))=max⁡a∈Aqπ(s,a)≥qπ(s,π(s))=vπ(s)q_\pi(s, \pi'(s)) = \max_{a\in A} q_\pi(s,a) \geq q_\pi(s, \pi(s)) = v_\pi(s)qπ​(s,π′(s))=maxa∈A​qπ​(s,a)≥qπ​(s,π(s))=vπ​(s)

This will always lead to the optimal policy π∗\pi^*π∗

Value Iteration

Any optimal policy can be subdivided into two components:

  • An optimal first action A∗A_*A∗​

  • Followed by an optimal policy from successor state S'

Deterministic Value Iteration

  • If we know the solution to subproblems v∗(s′)v_*(s')v∗​(s′)

  • Then solution to can be found by one-step lookahead v∗(s)=max⁡a∈A(Rsa+γ∑s′∈SPss′aV∗(s′))v_*(s) = \max_{a \in A}(R_s^a + \gamma \sum_{s' \in S}P_{ss'}^aV_*(s'))v∗​(s)=maxa∈A​(Rsa​+γ∑s′∈S​Pss′a​V∗​(s′))

  • we do this for every state until convergence

  • Start with the final rewards and work backwards

Using synchronous backups

  • At each iteration k + 1

  • For all states s ∈ S

  • Update vk+1(s)v_{k+1}(s)vk+1​(s) from vk(s′)v_k(s')vk​(s′)

Once we get the optimal value function, we can find the deterministic optimal policy using following:

π∗(s)=argmax⁡a∈A∑s′∈SPss′a\pi^*(s) = arg \max_{a \in A} \sum_{s'\in S}P_{ss'}^aπ∗(s)=arga∈Amax​s′∈S∑​Pss′a​

Extension to DP:

  • Synchronous Synchronous value iteration stores two copies of value function vnew(s)=max⁡a∈A(Rsa+γ∑s′∈SPss′avold(s′))vold=vnewv_{new}(s) = \max_{a \in A}(R_s^a + \gamma \sum_{s' \in S}P_{ss'}^av_{old}(s')) \\v_{old} = v_{new}vnew​(s)=maxa∈A​(Rsa​+γ∑s′∈S​Pss′a​vold​(s′))vold​=vnew​

  • Asynchronous

    • In-place DP This one just one value v∗(s)=max⁡a∈A(Rsa+γ∑s′∈SPss′av∗(s′))v_{*}(s) = \max_{a \in A}(R_s^a + \gamma \sum_{s' \in S}P_{ss'}^av_{*}(s')) v∗​(s)=maxa∈A​(Rsa​+γ∑s′∈S​Pss′a​v∗​(s′))

    • Prioritised sweeping

    • Real-time dynamic programming

  • Sample Backups Using sample rewards and sample transitions {S, A, R, S`}Instead of reward function R and transition dynamics P. Sample is the keyword here.

Convergence of Policy or Value Iteration.

This can be proved using contraction mapping theorem.

PreviousMDP EquationsNextValue Iteration

Last updated 12 months ago