🧠
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenberg–Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
    • Implicit Function Theorem
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • Diffusion Models
    • ImageGen
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  • Model: Transition and Reward
  • Lets start with value and action-value function for RL
  • Bellman Equation
  • Optimality
  • Law of Iterative Expectation
  • Policy
  1. Reinforcement Learning

MDP Equations

About the value and action-value equations

PreviousImitation LearningNextSolving MDP with known Model

Last updated 1 year ago

Episode: (s,a,s`,r) Let’s say when we are in state s, we decide to take action a to arrive in the next state s’ and obtain reward r.

Model: Transition and Reward

In this, the first equation represents the model, which is the joint probability distribution of reward and future state given current state and action taken. The reward given state and action taken can be calculated using summation over all possible values of r and weighing according to probability of occurring. Note: We can remove the double summation in calculating R(s,a) as R: SxAxS -> R i.e. it is completely determined by s,a,s'. Hence, we can remove the summation over r as once s,a,s' are fixed, we will have only one value of r to sum over.

The above equation is proved using law of total probability.

Lets start with value and action-value function for RL

Bellman Equation

These equation are derived using law of total expectation or law of iterative expectation. Click here to see the derivation.

On solving

Optimality

Bellman Optimality equations

The Bellman optimality equations are non-linear and no closed form solution exist like in the case of bellman expectation equations. Hence, it is difficult to solve for optimality exactly, hence we take use of other methods to find optimal state-value and action-value function.

Law of Iterative Expectation

Now see, how this is used to establish the relationship between state value and action value function.

Make sure that you under the equation the V and Q equation without the law, it is just common sense same as used in law of total probability. It is just that you divide an event into independent conditioned events and weight those according to their probability of occurring.

Policy

Stationary: These policy depends only upon current state and doesn't change with time. Non-Stationary: In these, policy may change with time, hence different action may be choosen at different time given the same current state.

P(s‘,r  ∣  s,a)=P[St+1=s‘,Rt+1=r  ∣  St=s,At=a]Pss‘a=P[St+1=s‘  ∣  St=s,At=a]=∑r∈RP(s‘,r  ∣  s,a)R(s,a)=E[Rt+1∣St=s,At=a]=∑r∈R∑s‘∈SP(s‘,r  ∣  s,a)×r=∑s‘∈SP(s‘,r  ∣  s,a)×R(s,a,s‘)P(s^`, r\;|\;s,a) = P[S_{t+1}=s^`, R_{t+1}=r\;|\;S_t = s, A_t=a]\\ P_{ss^`}^a = P[S_{t+1}=s^`\;|\; S_t=s, A_t=a] = \sum_{r \in R}P(s^`, r\;|\;s,a)\\ R(s,a) = E[R_{t+1}|S_t=s, A_t=a] = \sum_{r \in R}\sum_{s^` \in S}P(s^`, r\;|\;s,a)\times r\\ = \sum_{s^` \in S}P(s^`, r\;|\;s,a)\times R(s,a,s`)P(s‘,r∣s,a)=P[St+1​=s‘,Rt+1​=r∣St​=s,At​=a]Pss‘a​=P[St+1​=s‘∣St​=s,At​=a]=r∈R∑​P(s‘,r∣s,a)R(s,a)=E[Rt+1​∣St​=s,At​=a]=r∈R∑​s‘∈S∑​P(s‘,r∣s,a)×r=s‘∈S∑​P(s‘,r∣s,a)×R(s,a,s‘)
π(a∣s)=P[At=a∣St=s]Rπ(s)=∑a∈Aπ(a∣s)R(s,a)Pss‘π=∑a∈APssˊaπ(a∣s)\pi(a|s) = P[A_t = a| S_t=s]\\ R_\pi (s) = \sum_{a\in A}\pi (a|s)R(s,a)\\ P_{ss^`}^\pi = \sum_{a\in A}P_{s\acute{s}}^a\pi(a|s)π(a∣s)=P[At​=a∣St​=s]Rπ​(s)=a∈A∑​π(a∣s)R(s,a)Pss‘π​=a∈A∑​Pssˊa​π(a∣s)
Gt=Rt+1+γRt+2+...        1Vπ(s)=E[Gt∣St=s]                        2Qπ(s,a)=E[Gt∣St=s,At=a]    3G_t = R_{t+1} + \gamma R_{t+2} + ...\;\;\;\;1\\ V_\pi (s)= E[G_t | S_t = s] \;\;\;\;\;\;\;\;\;\;\;\; 2\\ Q_\pi (s,a)= E[G_t|S_t =s, A_t=a] \;\;3Gt​=Rt+1​+γRt+2​+...1Vπ​(s)=E[Gt​∣St​=s]2Qπ​(s,a)=E[Gt​∣St​=s,At​=a]3

In this, first ask why the Expectation is needed for fixed policy? Ans: Even if the policy is fixed, it determines the probability of transition between states given current state and action taken. Now, if we want to calculate the return GtG_tGt​ we have to weight the reward from each state according to the the probability of transition and probability of each action that can be taken which is determined by policy. Hence, in order to weight the reward according to transition probability and policy, we have to take the expectation.

Vπ=∑a∈Aπ(a∣s)Qπ(s,a)Qπ(s,a)=Rsa+γ∑s‘∈SPss‘aVπ(s‘)V_\pi = \sum_{a\in A}\pi(a|s)Q_\pi (s,a)\\ Q_\pi (s,a) = R_s^a + \gamma \sum_{s^`\in S}P_{ss^`}^aV_\pi(s^`)\\Vπ​=a∈A∑​π(a∣s)Qπ​(s,a)Qπ​(s,a)=Rsa​+γs‘∈S∑​Pss‘a​Vπ​(s‘)
Vπ(s)=∑a∈Aπ(a∣s)Qπ(s,a)=∑a∈Aπ(a∣s)(Rsa+γ∑s‘∈SPss‘aVπ(s‘))=∑a∈Aπ(a∣s)Rsa  +γ∑s‘∈S∑a∈Aπ(a∣s)Pss‘aVπ(s‘))=Rsπ+γ∑s‘∈SPss‘πVπ(s‘)=Rsπ+γE[Vπ(s‘)]V_\pi(s) = \sum_{a\in A}\pi(a|s)Q_\pi (s,a)\\ = \sum_{a\in A}\pi(a|s)(R_s^a + \gamma \sum_{s^`\in S}P_{ss^`}^aV_\pi(s^`))\\ = \sum_{a\in A}\pi(a|s)R_s^a \;+ \gamma\sum_{s^`\in S}\sum_{a\in A}\pi(a|s) P_{ss^`}^aV_\pi(s^`))\\ = R_s^\pi + \gamma\sum_{s^`\in S}P_{ss^`}^\pi V_\pi(s^`)\\ = R_s^\pi + \gamma E[V_\pi(s^`)]\\ Vπ​(s)=a∈A∑​π(a∣s)Qπ​(s,a)=a∈A∑​π(a∣s)(Rsa​+γs‘∈S∑​Pss‘a​Vπ​(s‘))=a∈A∑​π(a∣s)Rsa​+γs‘∈S∑​a∈A∑​π(a∣s)Pss‘a​Vπ​(s‘))=Rsπ​+γs‘∈S∑​Pss‘π​Vπ​(s‘)=Rsπ​+γE[Vπ​(s‘)]

Note: E[Vπ(s‘)]=∑s′∈SPss′πVπ(s′)E[V_\pi(s^`)] = \sum_{s'\in S}P_{ss'}^\pi V_\pi(s')E[Vπ​(s‘)]=∑s′∈S​Pss′π​Vπ​(s′), when using this equation we consider s′s's′ is the random variable. But when E[Vπ(s‘)]=∑a∈A∑s′∈Sπ(a∣s)Pss′aVπ(s′)E[V_\pi(s^`)] = \sum_{a \in A}\sum_{s'\in S}\pi(a|s) P_{ss'}^a V_\pi(s')E[Vπ​(s‘)]=∑a∈A​∑s′∈S​π(a∣s)Pss′a​Vπ​(s′) is used, then, a and s′a\text{ and } s'a and s′ are random variables and summation is done relative to both.

Qπ(s,a)=Rsa+γ∑s′∈SPss′aVπ(s′)=Rsa+γ∑s′∈SPss′a∑a′∈Aπ(a′∣s′)Qπ(s′,a′)=Rsa+γ∑s′∈S∑a′∈APss′a×π(a′∣s′)×Qπ(s′,a′)=Rsa+γE[Qπ(s′,a′)]Q_\pi(s,a) = R_s^a + \gamma\sum_{s' \in S}P_{ss'}^aV_\pi(s')\\ = R_s^a + \gamma\sum_{s' \in S}P_{ss'}^a\sum_{a' \in A}\pi(a'|s')Q_\pi(s', a')\\ = R_s^a + \gamma\sum_{s' \in S}\sum_{a' \in A}P_{ss'}^a\times\pi(a'|s')\times Q_\pi(s', a')\\ = R_s^a + \gamma E[Q_\pi(s',a')]Qπ​(s,a)=Rsa​+γs′∈S∑​Pss′a​Vπ​(s′)=Rsa​+γs′∈S∑​Pss′a​a′∈A∑​π(a′∣s′)Qπ​(s′,a′)=Rsa​+γs′∈S∑​a′∈A∑​Pss′a​×π(a′∣s′)×Qπ​(s′,a′)=Rsa​+γE[Qπ​(s′,a′)]
π∗=argmax⁡π  Vπ(s)=argmax⁡π  Qπ(s,a)    ∀s,aVπ∗=max⁡πVπ(s)      ∀sQπ∗=max⁡πQπ(s,a)    ∀s\pi^* = \text{arg}\max_\pi\;V_\pi(s) = \text{arg}\max_\pi \;Q_\pi(s,a) \;\;\forall s,a\\ V_{\pi^*} = \max_\pi V_\pi(s)\;\;\; \forall s\\ Q_{\pi^*} = \max_\pi Q_\pi(s,a)\;\; \forall sπ∗=argπmax​Vπ​(s)=argπmax​Qπ​(s,a)∀s,aVπ∗​=πmax​Vπ​(s)∀sQπ∗​=πmax​Qπ​(s,a)∀s

π∗\pi^* π∗ is the optimal policy. This policy have the maximum value function for all states. Now there will always be a deterministic optimal policy for a MDP, which we can find out using the action-value function. The action for a state which provide the maximum value gets the probability 1 and other actions get probability 0.

V∗(s)=max⁡aQ∗(s,a)Q∗(s,a)=Rsa+γ∑s‘∈SPss‘aV∗(s‘)V_*(s) = \max_a Q_*(s,a)\\ Q_*(s,a) = R_s^a + \gamma \sum_{s^`\in S}P_{ss_`}^a V_*(s^`)V∗​(s)=amax​Q∗​(s,a)Q∗​(s,a)=Rsa​+γs‘∈S∑​Pss‘​a​V∗​(s‘)
EY[E[X∣Y]]=∑y∈YE[X∣Y=y]PY(y)=∑y∈Y(∑x∈XxPX(x∣Y=y))PY(y)=∑x∈Xx(∑y∈YPX(x∣Y=y)PY(y))=∑x∈XxPX(x)=E[X]E_Y[E[X|Y]] = \sum_{y\in Y}E[X|Y=y]P_Y(y) = \sum_{y\in Y}\left(\sum_{x\in X}xP_X(x|Y=y)\right)P_Y(y)\\ = \sum_{x\in X}x\left(\sum_{y\in Y}P_X(x|Y=y)P_Y(y)\right) =\sum_{x\in X}xP_X(x)\\ = E[X]EY​[E[X∣Y]]=y∈Y∑​E[X∣Y=y]PY​(y)=y∈Y∑​(x∈X∑​xPX​(x∣Y=y))PY​(y)=x∈X∑​x​y∈Y∑​PX​(x∣Y=y)PY​(y)​=x∈X∑​xPX​(x)=E[X]
V(s)=E[Gt∣St=s]Q(s,a)=E[Gt∣St=s,At=a]V(s) = E[G_t|S_t=s]\\ Q(s,a) = E[G_t|S_t=s, A_t=a]\\ V(s)=E[Gt​∣St​=s]Q(s,a)=E[Gt​∣St​=s,At​=a]
V(s)=∑a∈Aπ(a∣s)Qπ(s,a)E[Gt∣St=s]=∑a∈AP(a∣St=s)E[Gt∣St=s,At=a]E[X]=∑y∈YPY(y)E[X∣Y]E[X]=EY[E[X∣Y]]V(s) = \sum_{a \in A}\pi(a|s)Q_\pi(s,a)\\ E[G_t|S_t=s] = \sum_{a\in A}P(a|S_t=s)E[G_t|S_t=s, A_t=a]\\ E[X] = \sum_{y \in Y}P_Y(y)E[X|Y]\\ E[X] = E_Y[E[X|Y]]V(s)=a∈A∑​π(a∣s)Qπ​(s,a)E[Gt​∣St​=s]=a∈A∑​P(a∣St​=s)E[Gt​∣St​=s,At​=a]E[X]=y∈Y∑​PY​(y)E[X∣Y]E[X]=EY​[E[X∣Y]]

So, replacing XX X by Gt∣StG_t|S_tGt​∣St​ and YYY by AtA_tAt​ gives the same equation as that of Law of total expectation.

Relation between state, observation and action in temporal domain
Diagram explanation for bellman equation