🧠
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenberg–Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
    • Representation and Policy Learning for Robotic Control
      • TDMPC
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Block Causal Attention
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • SSL, ViT, Latest vision learning spring
    • Images vs Text
    • Vision Transformers ViT
    • CLIP
    • Masked Autoencoders
    • DALL-E
    • V-JEPA / V-JEPA 2
  • Diffusion Models
    • ImageGen
    • Denoising Autoencoders
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Project Euler
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  1. Probability & Statistics
  2. Statistical Inference
  3. Bayesian/Probabilistic Inference

Conjugate Priors

PreviousProblems of Bayesian InferenceNextDirichlet Distribution/Process

Last updated 5 years ago

CtrlK
  • Definition
  • Pseudo-observations/Priors Hyperparameters.
  • Resources

Definition

In Bayesian probability theory, if the posterior distributions p(θ | x) are in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. For example, the Gaussian family is conjugate to itself (or self-conjugate) with respect to a Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior over the mean will ensure that the posterior distribution is also Gaussian. This means that the Gaussian distribution is a conjugate prior for the likelihood that is also Gaussian.

A conjugate prior is an algebraic convenience, giving a closed-form expression for the posterior; otherwise numerical integration may be necessary. Further, conjugate priors may give intuition, by more transparently showing how a likelihood function updates a prior distribution.

Consider the general problem of inferring a (continuous) distribution for a parameter θ given some datum or data x. From Bayes' theorem, the posterior distribution is equal to the product of the likelihood function and prior , normalized (divided) by the probability of the data :

Let the likelihood function be considered fixed; the likelihood function is usually well-determined from a statement of the data-generating process, hence this liklihood faunction is also called sampling distribution. It is clear that different choices of the prior distribution p(θ) may make the integral more or less difficult to calculate, and the product p(x|θ) × p(θ) may take one algebraic form or another. For certain choices of the prior, the posterior has the same algebraic form as the prior (generally with different parameter values). Such a choice is a conjugate prior.

Check above to see how conjugate prior helps in analytically solve the posterior.

Pseudo-observations/Priors Hyperparameters.

Priors Hyper parameters are the parameters of the prior distribution. For example, α\alphaαvector in dirichlet distribution, (α,β)(\alpha, \beta)(α,β)in Beta Distribution, etc.

It is often useful to think of the hyperparameters of a conjugate prior distribution as corresponding to having observed a certain number of pseudo-observations with properties specified by the parameters. In general, for nearly all conjugate prior distributions, the hyperparameters can be interpreted in terms of pseudo-observations. This can help both in providing an intuition behind the often messy update equations, as well as to help choose reasonable hyperparameters for a prior.

Look at the example here to understand: https://en.wikipedia.org/wiki/Conjugate_prior

In the first answer, see how α\alphaα represents Pseudo-count in Dirichlet Distribution. Basically, choosing the correct value of α\alphaαallows you to get the right prior about the parameters. Hence, those α\alphaαbasically encoded the prior information you have. In the above exmample, and guess of how many balls of each color are there in the bag decided your prior and represents by appropriate value of α\alphaα.

Resources

{\displaystyle {\begin{aligned}p(\theta \mid x)&={\frac {p(x\mid \theta )\,p(\theta )}{p(x)}}\\&={\frac {p(x\mid \theta )\,p(\theta )}{\int p(x\mid \theta ')\,p(\theta ')\,d\theta '}}\end{aligned}}}
\theta \mapsto p(x\mid \theta )\!
p(\theta )\!
p(x)\!
What exactly is the alpha in the Dirichlet distribution?Cross Validated
Analytical Bayesian Inference with Conjugate PriorsMedium
Logo
Logo