🧠
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenberg–Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
    • Representation and Policy Learning for Robotic Control
      • TDMPC
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Block Causal Attention
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • SSL, ViT, Latest vision learning spring
    • Images vs Text
    • Vision Transformers ViT
    • CLIP
    • Masked Autoencoders
    • DALL-E
    • V-JEPA / V-JEPA 2
  • Diffusion Models
    • ImageGen
    • Denoising Autoencoders
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Project Euler
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  1. Probability & Statistics
  2. Statistical Inference
  3. Bayesian/Probabilistic Inference

Dirichlet Distribution/Process

PreviousConjugate PriorsNextPosterior Predictive Distribution

Last updated 5 years ago

CtrlK
  • Dirichlet Distribution
  • As Conjugate Prior
  • Dirichlet for Bayesian inference for Categorical/Multinomial Distribution
  • Intuition of Dirichlet Distribution
  • Dirichlet Process
  • Resouces

Dirichlet Distribution

Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.

The Dirichlet distribution of order K ≥ 2 with parameters α1, ..., αK > 0 has a probability density function with respect to Lebesgue measure on the Euclidean space RK−1 given by f(x1,…,xK;α1,…,αK)=1B(α)∏i=1Kxiαi−1{\displaystyle f\left(x_{1},\ldots ,x_{K};\alpha _{1},\ldots ,\alpha _{K}\right)={\frac {1}{\mathrm {B} ({\boldsymbol {\alpha }})}}\prod _{i=1}^{K}x_{i}^{\alpha _{i}-1}}f(x1​,…,xK​;α1​,…,αK​)=B(α)1​i=1∏K​xiαi​−1​ where belong to the standard simplex, or in other words:

Here α\alphaαcan be interpreted as "prior observation count". See Pseudo-Observations.

The normalizing constant is the multivariate beta function, which can be expressed in terms of the gamma function: B(α)=∏i=1KΓ(αi)Γ(∑i=1Kαi),α=(α1,…,αK).{\displaystyle \mathrm {B} ({\boldsymbol {\alpha }})={\frac {\prod _{i=1}^{K}\Gamma (\alpha _{i})}{\Gamma \left(\sum _{i=1}^{K}\alpha _{i}\right)}},\qquad {\boldsymbol {\alpha }}=(\alpha _{1},\ldots ,\alpha _{K}).}B(α)=Γ(∑i=1K​αi​)∏i=1K​Γ(αi​)​,α=(α1​,…,αK​).

As Conjugate Prior

The Dirichlet distribution is the conjugate prior distribution of the categorical distribution (a generic discrete probability distribution with a given number of possible outcomes) and multinomial distribution (the distribution over observed counts of each possible category in a set of categorically distributed observations). This means that if a data point has either a categorical or multinomial distribution, and the prior distribution of the distribution's parameter (the vector of probabilities that generates the data point) is distributed as a Dirichlet, then the posterior distribution of the parameter is also a Dirichlet. Intuitively, in such a case, starting from what we know about the parameter prior to observing the data point, we then can update our knowledge based on the data point and end up with a new distribution of the same form as the old one. This means that we can successively update our knowledge of a parameter by incorporating new observations one at a time, without running into mathematical difficulties.

Formally, this can be expressed as follows. Given a model α=(α1,…,αK)=concentration hyperparameterp∣α=(p1,…,pK)∼Dir⁡(K,α)X∣p=(x1,…,xK)∼Cat⁡(K,p){\displaystyle {\begin{array}{rcccl}{\boldsymbol {\alpha }}&=&\left(\alpha _{1},\ldots ,\alpha _{K}\right)&=&{\text{concentration hyperparameter}}\\\mathbf {p} \mid {\boldsymbol {\alpha }}&=&\left(p_{1},\ldots ,p_{K}\right)&\sim &\operatorname {Dir} (K,{\boldsymbol {\alpha }})\\\mathbb {X} \mid \mathbf {p} &=&\left(\mathbf {x} _{1},\ldots ,\mathbf {x} _{K}\right)&\sim &\operatorname {Cat} (K,\mathbf {p} )\end{array}}}αp∣αX∣p​===​(α1​,…,αK​)(p1​,…,pK​)(x1​,…,xK​)​=∼∼​concentration hyperparameterDir(K,α)Cat(K,p)​

then the following holds: c=(c1,…,cK)=number of occurrences of category ip∣X,α∼Dir⁡(K,c+α)=Dir⁡(K,c1+α1,…,cK+αK){\displaystyle {\begin{array}{rcccl}\mathbf {c} &=&\left(c_{1},\ldots ,c_{K}\right)&=&{\text{number of occurrences of category }}i\\\mathbf {p} \mid \mathbb {X} ,{\boldsymbol {\alpha }}&\sim &\operatorname {Dir} (K,\mathbf {c} +{\boldsymbol {\alpha }})&=&\operatorname {Dir} \left(K,c_{1}+\alpha _{1},\ldots ,c_{K}+\alpha _{K}\right)\end{array}}}cp∣X,α​=∼​(c1​,…,cK​)Dir(K,c+α)​==​number of occurrences of category iDir(K,c1​+α1​,…,cK​+αK​)​

This relationship is used in Bayesian statistics to estimate the underlying parameter p of a categorical distribution given a collection of N samples. Intuitively, we can view the hyperprior vector α as pseudocounts, i.e. as representing the number of observations in each category that we have already seen (or guessed). Then we simply add in the counts for all the new observations (the vector c) in order to derive the posterior distribution.

In Bayesian mixture models and other hierarchical Bayesian models with mixture components, Dirichlet distributions are commonly used as the prior distributions for the categorical variables appearing in the models. See the section on applications below for more information.

Dirichlet for Bayesian inference for Categorical/Multinomial Distribution

Intuition of Dirichlet Distribution

Let's say you have baised die, i.e probability of every number (class) is not equal. So now we have a categorical/multinomial distribution with unknown parameters based on if you roll it once or multiple times respectively.

Now you want to estimate the parameters of that categorical\multinomial distribution i.e what is the probability of each of the faces of die? Parameters of multinomial distribution is given as

p=[p1,p2,...pk] where ∑ipi=1\boldsymbol{p} = [p_1, p_2, ... p_k] \text{ where } \sum_ip_i = 1p=[p1​,p2​,...pk​] where i∑​pi​=1

pip_ipi​denotes the the probability of output to belong to class iii.

So now how would estimate the parameters pip_ipi​which are basically of probability of class iii.

Solution:

Roll out the dice many times, let's say 30. And denote the frequency of each of the output (classes). Let's we rolled the dice for n=30n=30n=30times, and we got outputs with the frequency α1=2,α2=4,α3=10,α4=4,α5=2,α6=8\alpha_1=2, \alpha_2=4, \alpha_3=10, \alpha_4=4, \alpha_5=2, \alpha_6=8 α1​=2,α2​=4,α3​=10,α4​=4,α5​=2,α6​=8. Now what you think might be the value of parameter p3p_3p3​, i would say α3∑jαj=1030=13=0.33\frac{\alpha_3}{\sum_j \alpha_j} = \frac{10}{30} = \frac{1}{3} = 0.33∑j​αj​α3​​=3010​=31​=0.33. So basically we are estimating the parameters which are Random Variable here using the simulations here. But note that p3=0.33p_3 = 0.33p3​=0.33 is just an estimate i.e 0.33 is not the only value possible for p3p_3p3​. Hence basically you can assocaite a probability distribution to each pip_ipi​based on α\boldsymbol{\alpha}α. This probability distribution of p\boldsymbol{p}p based on α\boldsymbol{\alpha}αis nothing but the Dirichlet Distribution.

And to be precise 0.33 is the mean of random variable p3p_3p3​. So, for every iii we have E[pi]=αi∑jαjE[p_i] = \frac{\alpha_i}{\sum_j \alpha_j}E[pi​]=∑j​αj​αi​​.

and the complete is distribution is given by

Dir(p∣α1,α2,...αk)=Γ(∑i=1Kαi)Πi=1KΓ(αi)Πi=1Kpiαi−1Dir(\boldsymbol{p}| \alpha_1, \alpha_2,... \alpha_k) =\frac{ \Gamma (\sum_{i=1}^K \alpha_i)}{\Pi_{i=1}^K \Gamma(\alpha_i) } \Pi_{i=1}^K p_i^{\alpha_i-1}Dir(p∣α1​,α2​,...αk​)=Πi=1K​Γ(αi​)Γ(∑i=1K​αi​)​Πi=1K​piαi​−1​

Dirichlet Process

A Dirichlet process is a probability distribution, whose (process's) range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution.

The Dirichlet process is specified by a base distribution {\displaystyle H} and a positive real number {\displaystyle \alpha } called the concentration parameter (also known as scaling parameter). The base distribution is the expected value of the process, i.e., the Dirichlet process draws distributions "around" the base distribution the way a normal distribution draws real numbers around its mean. However, even if the base distribution is continuous, the distributions drawn from the Dirichlet process are almost surely discrete. The scaling parameter specifies how strong this discretization is: in the limit of {\displaystyle \alpha \rightarrow 0}, the realizations are all concentrated at a single value, while in the limit of {\displaystyle \alpha \rightarrow \infty } the realizations become continuous. Between the two extremes the realizations are discrete distributions with less and less concentration as {\displaystyle \alpha } increases.

The Dirichlet process can also be seen as the infinite-dimensional generalization of the Dirichlet distribution. In the same way as the Dirichlet distribution is the conjugate prior for the categorical distribution, the Dirichlet process is the conjugate prior for infinite, nonparametric discrete distributions. A particularly important application of Dirichlet processes is as a prior probability distribution in infinite mixture models.

Resouces

  • Detailed PDF on Dirichlet Distribution

{\displaystyle \sum _{i=1}^{K}x_{i}=1{\mbox{ and }}x_{i}\geq 0{\mbox{ for all }}i\in [1,K]}
K-1
{\displaystyle \{x_{k}\}_{k=1}^{k=K}}
\alpha \rightarrow 0
\alpha
\alpha \rightarrow \infty
H
\alpha
https://people.eecs.berkeley.edu/~stephentu/writeups/dirichlet-conjugate-prior.pdfpeople.eecs.berkeley.edu
Visualizing Dirichlet Distributions with Matplotlib