🧠
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenberg–Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
    • Implicit Function Theorem
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • Diffusion Models
    • ImageGen
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  • Summary
  • Probabilistic Model
  • What we want to achieve?
  • Problem of Intractable Posterior.
  • Solution: Variational Inference
  • The Evidence Lower BOund (ELBO)
  • Resources
  1. Deep Learning

Variational Inference

PreviousBatch NormalizationNextVAE: Variational Auto-Encoders

Last updated 8 months ago

Summary

Variational Inference allow us to re-write statistical inference problems (i.e. infer the value of a random variable given the value of another random variable) as optimization problems (i.e. find the parameter values that minimize some objective function)

When you have intractable probability distribution ppp. Variational techniques will try to solve an optimization problem over a class of tractable distributions Q in order to find a q ∈ Q that is most similar to p.

For example, we can use KL divergence between ppp and qqq, so that qqq will be approximate to ppp which intractable, hence we can use qqq in place of ppp.

KL(q∣∣p)=βˆ‘xq(x)log⁑q(x)p(x)KL(q||p) = \sum_x q(x)\log \frac{q(x)}{p(x)}KL(q∣∣p)=xβˆ‘β€‹q(x)logp(x)q(x)​

Probabilistic Model

Let's say you have some data denoted by a random variable XXX. XXXis basically your observed (I will explain what's meant by observed) variable. Now we believe that there is some random variable ZZZwhich is responsible for generating XXX, but ZZZis hidden/latent i.e. we don't have any information about ZZZdirectly. The idea of latent variable is basically our belief of having a hidden factor ZZZ behind generation of our data XXX.

Let's take an example of a factory that manufacture boxes. Now we have dataset of number of boxes generated by the factory each day over the year. This is our observed variable XXX. Why is this called observed variable? Because we have observed this i.e it's given. Now if XXXis observed, then something must be not observed in contrast, else why would be specifically call it observed.

Now think of factors that can affect XXX i.e number of boxes produced in a day. There can be multiple factors which influence XXXsuch as number of workers on that day, number of machines operating that day, morale of workers that day, etc. Let's take random variable ZZZwhich is number of machines working on a day. Now one things to understand is that ZZZinfluence XXX, more number of machines working then more number of boxes produced. But the things is that we don't have any data regarding ZZZ, that's the reason that ZZZis called hidden/latent variable.

What we want to achieve?

We want to infer about latent/hidden variable ZZZgiven the variable XXX. We want to find how many machine would have been working on some day if I know number of boxes produced that day.

How to the inference then?

Thank's to Bayes, we have the answer to this. We want to infer about ZZZgiven the evidence i.e dataXXX.This is done using posterior probability.

Definition of posterior distribution: It's the probability distribution of random variable conditioned on some evidence denoted by p(Z∣X)p(Z|X)p(Z∣X).

Using Bayes theorem:

P(Z∣X)=P(X∣Z)p(Z)P(X)=P(X∣Z)P(Z)∫zP(X∣Z)P(Z)P(Z|X) = \frac{P(X|Z)p(Z)}{P(X)} = \frac{P(X|Z)P(Z)}{\int_z P(X|Z)P(Z)} P(Z∣X)=P(X)P(X∣Z)p(Z)​=∫z​P(X∣Z)P(Z)P(X∣Z)P(Z)​

Calculating the posterior

We have basically three terms:

P(X∣Z)P(X|Z)P(X∣Z): Likelihood. This is basically model of observing XXXif I would have known latent variable ZZZ. How do we get P(X∣Z)P(X|Z)P(X∣Z)? We assume a model for XXXgiven ZZZi.e I would know what distribution followed by X if I know ZZZ. For example. I can say that XXXfollows Gaussian with ZZZas it's mean with some constant variance. Or if you don't know exact model between XXXand ZZZ, you can parametrize P(X∣Z)P(X|Z)P(X∣Z)with some parameter θ\thetaθas P(X∣Z,θ)P(X|Z,\theta)P(X∣Z,θ)and you learn that θ\thetaθthrough different learning techniques. I don't know more about this concretely and have to see how exactly is this done. But, let's just assume we know this and it isn't a problem for now.

P(Z)P(Z) P(Z): Prior. This simply denotes your prior belief about ZZZ. Mostly considered non-informative. Or if you have any prior belief, use it's density function.

P(X)=∫zP(X∣Z)P(Z)P(X) = \int_z P(X|Z)P(Z)P(X)=∫z​P(X∣Z)P(Z): Evidence likelihood. Now this is probability of observing XXXin total. Think here that X=xX=xX=xhave some particular value. We can get this by integrating over all possible ZZZand then use likelihood of XXXfor each ZZZ . This is basically marginalization over ZZZ. Think in that terms and you would understand what it means.

Problem of Intractable Posterior.

Now what here is the problem of Intractable integral. So, what happens is that calculating the integral for P(X)P(X)P(X)many times it's intractable. For example, in our case Z=Z=Z=number of machines running on a day. Here ZZZis single dimensional, hence calculating the integral is possible. But consider if ZZZis ddd dimensional vector, in this case integral becomes intractable as we have integrate over ddddimensions as ∫1∫2..∫dP(X∣z1,z2,..zd)P(z1,z2,..zd)\int_1\int_2..\int_dP(X|z_1,z_2,..z_d)P(z_1,z_2,..z_d)∫1β€‹βˆ«2​..∫d​P(X∣z1​,z2​,..zd​)P(z1​,z2​,..zd​) . Calculating this is very difficult. For example, let say you have nnnfacial images of KKKdifferent persons. But in the data you only have images and identity is not given. Now here XXXis some given image and ZZZis identity of person whose the image belong. Now what is meant to calculate the posterior P(Z∣X)P(Z|X)P(Z∣X)? It will simple mean to find the identity of person of a person given the image. In this what would mean to calculate P(X)=∫zP(X∣Z)P(Z)P(X) = \int_z P(X|Z)P(Z)P(X)=∫z​P(X∣Z)P(Z)? This means to integrate over identity ZZZof peoples which is ddddimensional vector, in words it's taking some person and finding the probability of getting that facial image from that person and sum it for all the persons. Now this is intractable as large dimension of identity.

So, in most cases where ZZZis large dimensional vector, it becomes intractable to calculate the posterior distribution.

Solution: Variational Inference

The idea is really simple: If we can’t get a tractable closed-form solution for P(Z∣X)P(Z|X)P(Z∣X), we’ll approximate it.

Let the approximation be Q(Z;Ο•)Q(Z;Ο•)Q(Z;Ο•) and we can now form this as an optimization problem:

Ο•βˆ—=arg⁑minΟ•KL[Q(Z;Ο•)∣∣P(Z∣X)]\phi^* = \arg min_Ο• KL[Q(Z;Ο•) || P(Z|X)]Ο•βˆ—=argminϕ​KL[Q(Z;Ο•)∣∣P(Z∣X)]

By choosing a family of distribution Q(Z;Ο•)Q(Z;Ο•)Q(Z;Ο•) flexible enough to model P(Z∣X)P(Z|X)P(Z∣X) and optimizing over ϕϕϕ, we can push the approximation towards the real posterior.

Now let’s expand the KL-divergence term

KL[Q(Z;Ο•)∣∣P(Z∣X)]=EQ[log⁑Q(Z;Ο•)]βˆ’EQ[log⁑P(Z∣X)]=EQ[log⁑Q(Z;Ο•)]βˆ’EQ[log⁑P(X,Z)P(X)]=EQ[log⁑Q(Z;Ο•)]βˆ’EQ[log⁑P(X,Z)]+log⁑P(X)KL[Q(Z;Ο•) || P(Z|X)] =E_Q[\log Q(Z;Ο•)]βˆ’E_Q[\log P(Z|X)] \\ =E_Q[\log Q(Z;Ο•)]βˆ’E_Q[\log \frac{P(X,Z)}{P(X)}] \\ =E_Q[\log Q(Z;Ο•)]βˆ’E_Q[\log P(X,Z)]+ \log P(X)KL[Q(Z;Ο•)∣∣P(Z∣X)]=EQ​[logQ(Z;Ο•)]βˆ’EQ​[logP(Z∣X)]=EQ​[logQ(Z;Ο•)]βˆ’EQ​[logP(X)P(X,Z)​]=EQ​[logQ(Z;Ο•)]βˆ’EQ​[logP(X,Z)]+logP(X)

Although we can compute the first two terms in the above expansion, but oh lord ! the third term is the same annoying (intractable) integral we were avoiding before. What do we do now ? This seems to be a deadlock !

The Evidence Lower BOund (ELBO)

Please recall that our original objective was a minimization problem over Q(β‹…;Ο•)Q(β‹…;Ο•)Q(β‹…;Ο•). We can pull a little trick here - we can optimize only the first two terms and ignore the third term. How ?

Because the third term is independent of Q(β‹…;Ο•)Q(β‹…;Ο•)Q(β‹…;Ο•). So, we just need to minimize

EQ[log⁑Q(Z;Ο•)]βˆ’EQ[log⁑P(X,Z)]E_ Q[\log Q(Z;Ο•)]βˆ’E_Q[\log P(X,Z)]EQ​[logQ(Z;Ο•)]βˆ’EQ​[logP(X,Z)]

Or equivalently, maximize (just flip the two terms)

ELBO(Q)β‰œEQ[log⁑P(X,Z)]βˆ’EQ[log⁑Q(Z;Ο•)]ELBO(Q)β‰œE_Q[\log P(X,Z)]βˆ’E_Q[\log Q(Z;Ο•)]ELBO(Q)β‰œEQ​[logP(X,Z)]βˆ’EQ​[logQ(Z;Ο•)]

This term, usually defined as ELBO, is quite famous in VI literature and you have just witnessed how it looks like and where it came from. Taking a deeper look into the ELBO(β‹…)

ELBO(Q)=EQ[log⁑P(X∣Z)]+EQ[log⁑P(Z)]βˆ’EQ[log⁑Q(Z;Ο•)]=EQ[log⁑P(X∣Z)]βˆ’KL[Q(Z;Ο•)∣∣P(Z)]ELBO(Q)=E_Q[\log P(X|Z)]+E_Q[\log P(Z)]βˆ’E_Q[\log Q(Z;Ο•)]\\ =E_Q[\log P(X|Z)]βˆ’KL[Q(Z;Ο•) || P(Z)]ELBO(Q)=EQ​[logP(X∣Z)]+EQ​[logP(Z)]βˆ’EQ​[logQ(Z;Ο•)]=EQ​[logP(X∣Z)]βˆ’KL[Q(Z;Ο•)∣∣P(Z)]

There is one more interpretation (see figure 5) of the KL-divergence expansion that is interesting to us. Rewriting the KL-expansion and substituting ELBO(β‹…)ELBO(β‹…) definition, we get

log⁑P(X)=ELBO(Q)+KL[Q(Z;Ο•)∣∣P(Z∣X)]\log P(X)=ELBO(Q)+KL[Q(Z;Ο•) || P(Z|X)]logP(X)=ELBO(Q)+KL[Q(Z;Ο•)∣∣P(Z∣X)]

As we know that KL(β‹…βˆ£βˆ£β‹…)β‰₯0KL(β‹…||β‹…)β‰₯0KL(β‹…βˆ£βˆ£β‹…)β‰₯0 for any two distributions, the following inequality holds

log⁑P(X)β‰₯ELBO(Q)\log P(X)β‰₯ELBO(Q)logP(X)β‰₯ELBO(Q)

So, the ELBO(β‹…)ELBO(β‹…)ELBO(β‹…) that we vowed to maximize is a lower bound on the observed data log likelihood. Thats amazing, isn’t it ! Just by maximing the ELBO(β‹…)ELBO(β‹…)ELBO(β‹…), we can implicitely get closer to our dream of estimating maximum (log)-likelihood - tighter the bound, better the approximation.

Okay ! Way too much math for today. This is overall how the Variational Inference looks like. There are numerous

Resources

Now, please consider looking at the last equation for a while because that is what all our efforts led us to. The last equation is totally tractable and also solves our problem. What it basically says is that maximizing ELBO(β‹… (which is a proxy objective for our original optimization problem) is equivalent to maximizing the conditional data likelihood (which we can choose in our graphical model design) and simultaneously pushing our approximate posterior (i.e., Q(;Ο•)Q(;Ο•)Q(;Ο•)) towards a prior over ZZZ. The prior P(Z)P(Z)P(Z) is basically how the true latent space is organized. Now the immediate question might arise: β€œWhere do we get P(Z)P(Z)P(Z) from?”. The answer is, we can just choose any distribution as a hypothesis. It will be our belief of how the ZZZ space is organized.Fig.5: Interpretation of ELBO

https://ayandas.me/blogs/2019-11-20-inference-in-pgm.html
https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf
https://blog.evjang.com/2016/08/variational-bayes.html