🧠
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenberg–Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
    • Implicit Function Theorem
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • Diffusion Models
    • ImageGen
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  • What is Bayesian Inference?
  • Bayesian vs Non-Bayesian Estimates
  • Bayesian Inference for Parameters given Data
  • Calculating posterior as weighted average of priors
  • How to use Bayesian Learning principles in Deep Learning?
  • Resources
  1. Probability & Statistics
  2. Statistical Inference

Bayesian/Probabilistic Inference

Notes on Bayesian Inference

PreviousEstimationNextProbabilistic Modelling

Last updated 5 years ago

Bayesian inference refers to where uncertainty in inferences is quantified using probability. In classical , model and hypotheses are considered to be fixed. Probabilities are not assigned to parameters or hypotheses in frequentist inference. For example, it would not make sense in frequentist inference to directly assign a probability to an event that can only happen once, such as the result of the next flip of a fair coin. However, it would make sense to state that the proportion of heads as the number of coin flips increases.

specify a set of statistical assumptions and processes that represent how the sample data is generated. Statistical models have a number of parameters that can be modified. For example, a coin can be represented as samples from a , which models two possible outcomes. The Bernoulli distribution has a single parameter equal to the probability of one outcome, which in most cases is the probability of landing on heads. Devising a good model for the data is central in Bayesian inference. In most cases, models only approximate the true process, and may not take into account certain factors influencing the data. In Bayesian inference, probabilities can be assigned to model parameters. Parameters can be represented as . Bayesian inference uses Bayes' theorem to update probabilities after more evidence is obtained or known.

What is Bayesian Inference?

Bayesian inference is about inferring the distribution of some random variable of event AAAgiven the other event BBB. As follows.

P(A∣B)=P(B∣A)P(A)P(B)P(A|B) = \frac{P(B|A)P(A)}{P(B)}P(A∣B)=P(B)P(B∣A)P(A)​

This is normal way of seeing Bayes Theorem for two events. Now let us see it in terms of hypothesis (belief) and observed data.

P(h∣D)=P(D∣h)P(h)P(D)P(h|D) = \frac{P(D|h)P(h)}{P(D)}P(h∣D)=P(D)P(D∣h)P(h)​

where hhh denotes a hypothesis and DDD denotes observed data. Hence using this Bayes theorem we want to find the distribution of hypothesis given the observed data i.e probability of different hypothesis being true given the data.

Explanation: Here P(h)P(h)P(h) is our prior or prior belief about some statistic, then we have P(D∣h)P(D|h)P(D∣h)which is called likelihood and represents observing some actual data given the hypothesis, also called sampling distribution for this reason. After this we get our P(h∣D)P(h|D)P(h∣D) called posterior and this represents our updated believe about the statistic after observing the data D\boldsymbol{D}D . Note: Here P(D)\boldsymbol{P(D)}P(D) denoted the distribution of data. Now as P(D)\boldsymbol{P(D)}P(D) doesn't depend upon h\boldsymbol{h}h , it does not provide any information to update the hypothesis. - Make sure to understand the graph shown between 16:50-19:13 In this videos after 19:20 he tells about how to use prior P(h)\boldsymbol{P(h)}P(h) which will finally affect the calculation of P(h∣D)\boldsymbol{P(h|D)}P(h∣D) .

Bayesian vs Non-Bayesian Estimates

Generally when using maximum likelihood estimate we assume uniform prior i.e we do not assume any belief about hypothesis h\boldsymbol{h}h , means we assume that all hypothesis in the hypothesis space are equally likely . When we assume uniform prior then most likely h\boldsymbol{h}h found using bayes theorem is called maximum likelihood estimate and this method is called maximum likelihood estimation because to get the posterior we use the likelihood and choose the h\boldsymbol{h}h for which likelihood is maximum. When using MLE is a non-bayesian estimate as it doesn't make use of any prior.

But many times we can have some bias for specific hypothesis i.e have some prior, like knowing that value of hypothesis will always be in some range [a,b][a,b][a,b] means P(h<a)=0    and    P(h>b)=0P(h<a) = 0 \; \; and \; \; P(h>b) = 0 P(h<a)=0andP(h>b)=0 . When we use the prior and doesn't assume it to be uniform, then we use both likelihood and prior to choose the hypothesis h\boldsymbol{h}h , we call it Maximum A Posteriori (MAP). This is called Bayesian Estimate.

Bayesian Estimate are more narrower than non-bayesian due to use of prior which makes distribution more confined. And generally give more closer and accurate distribution. Using beliefs allows us to get more confident answer. But sometimes using bayesian can be wrong, like what if you have wrong prior i.e you said some values are impossible then bayesian estimate will not consider those values, but may be those value are possible and should not have probability zero.

Note: Lets say you are using bayesian inference, but your prior ininformative i.e. it is uniform or doesn't really favour any prior parameter/hypothesis. Then does bayesian inference is any better than non-bayesian estimation? So baiscally, it seems that bayesian inference is only useful when you actually have some prior information, and using ininformative prior is not of any good.

Baye's Factor: The amount of information we learned about the hypothesis using the data observed.

Bayesian Inference for Parameters given Data

Instead of hhh, you can have θ\boldsymbol \thetaθwhich is basically parameter vector of some model/distribution which you think might have generated the data DDD. So we basically get

p(θ∣D)=p(D∣θ)p(θ)p(D)p(\theta|D) = \frac{p(D|\theta)p(\theta)}{p(D)}p(θ∣D)=p(D)p(D∣θ)p(θ)​

Let's say you have D={y1,y2,..yn}D= \{y_1, y_2, .. y_n\}D={y1​,y2​,..yn​}, this data is generated by some model/distribution whose parameters you don't know. But you generally make an assumption about the family of distriubiton from which the data might have been sampled, like is it sampled from gaussian distribution or piosson distribution. In case of gaussian, θ=[μ,σ]\theta = [\mu, \sigma]θ=[μ,σ]will be your unknown parameters.

Now your parameter θ\thetaθis a random variable whose distribution you want to estimate given the data DDD, which you can do using above bayesian inference equation. p(θ)p(\theta)p(θ)is the prior distribution i.e. your previous belief about the random variable θ\thetaθbefore seeing the data DDD.

p(D∣θ)=Πi=1np(yi∣θ)p(D|\theta) = \Pi_{i=1}^n p(y_i|\theta)p(D∣θ)=Πi=1n​p(yi​∣θ)

Why I said, that above is not an pdf but only liklihood is because yiy_iyi​is not random variable but θ\thetaθis. You just calculate the liklihood of yiy_iyi​.

Calculating posterior as weighted average of priors

Let's look at an interesting take to understand posterior

So basically, here you have some priors already which you think can explain your data. Now instead of choosign one prior which is case of point-based estimation, you assign a score to each prior which basically denotes how liklely that prior is or what is the probability of that prior is. And this score is nothing but the liklihood calcualted using the prior. Now to make it into probability Distribution, you normalise it.

How to use Bayesian Learning principles in Deep Learning?

Resources

We can use Baye's factor to tell which hypothesis is more likely to to be true given the data.

p(D∣θ)p(D|\theta)p(D∣θ)is your liklihood, because it tells about the liklhood of observing data given parameters θ\thetaθ. One think to note here is that this liklihood is not a pdf and only an function of θ\thetaθ . It is not an pdf in the sense because in p(D∣θ)p(D|\theta)p(D∣θ) your DDDis not any random variable. See for more on this. But still, the value of p(D∣θ)p(D|\theta)p(D∣θ)gives you the liklihood (probability) of observing DDDas a function of θ\thetaθ. Considering yiy_iyi​as iid, we can write

- Complete Bayesian Inference Notes.

statistical inference
frequentist inference
parameters
approaches one-half
[7]
Statistical models
Bernoulli distribution
[1]
random variables
[1]
[8]
https://youtu.be/5NMxiOGL39M
https://youtu.be/9TDjifpGj-k?list=PL8dPuuaLjXtNM_Y-bUAhblSAdWRnmBUcr
this
https://brohrer.github.io/how_bayesian_inference_works.html
https://vioshyvo.github.io/Bayesian_inference/index.html
4MB
How Bayesian inference works .pdf
pdf
Slides for the Video