🧠
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenberg–Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
    • Implicit Function Theorem
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • Diffusion Models
    • ImageGen
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  • Dirichlet Distribution
  • As Conjugate Prior
  • Dirichlet for Bayesian inference for Categorical/Multinomial Distribution
  • Intuition of Dirichlet Distribution
  • Dirichlet Process
  • Resouces
  1. Probability & Statistics
  2. Statistical Inference
  3. Bayesian/Probabilistic Inference

Dirichlet Distribution/Process

PreviousConjugate PriorsNextPosterior Predictive Distribution

Last updated 5 years ago

Dirichlet Distribution

Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.

The Dirichlet distribution of order K ≥ 2 with parameters α1, ..., αK > 0 has a with respect to on the RK−1 given by f(x1,…,xK;α1,…,αK)=1B(α)∏i=1Kxiαi−1{\displaystyle f\left(x_{1},\ldots ,x_{K};\alpha _{1},\ldots ,\alpha _{K}\right)={\frac {1}{\mathrm {B} ({\boldsymbol {\alpha }})}}\prod _{i=1}^{K}x_{i}^{\alpha _{i}-1}}f(x1​,…,xK​;α1​,…,αK​)=B(α)1​i=1∏K​xiαi​−1​ where belong to the standard , or in other words:

Here α\alphaαcan be interpreted as "prior observation count". See .

The is the multivariate , which can be expressed in terms of the : B(α)=∏i=1KΓ(αi)Γ(∑i=1Kαi),α=(α1,…,αK).{\displaystyle \mathrm {B} ({\boldsymbol {\alpha }})={\frac {\prod _{i=1}^{K}\Gamma (\alpha _{i})}{\Gamma \left(\sum _{i=1}^{K}\alpha _{i}\right)}},\qquad {\boldsymbol {\alpha }}=(\alpha _{1},\ldots ,\alpha _{K}).}B(α)=Γ(∑i=1K​αi​)∏i=1K​Γ(αi​)​,α=(α1​,…,αK​).

As Conjugate Prior

The Dirichlet distribution is the distribution of the (a generic with a given number of possible outcomes) and (the distribution over observed counts of each possible category in a set of categorically distributed observations). This means that if a data point has either a categorical or multinomial distribution, and the of the distribution's parameter (the vector of probabilities that generates the data point) is distributed as a Dirichlet, then the of the parameter is also a Dirichlet. Intuitively, in such a case, starting from what we know about the parameter prior to observing the data point, we then can update our knowledge based on the data point and end up with a new distribution of the same form as the old one. This means that we can successively update our knowledge of a parameter by incorporating new observations one at a time, without running into mathematical difficulties.

Formally, this can be expressed as follows. Given a model α=(α1,…,αK)=concentration hyperparameterp∣α=(p1,…,pK)∼Dir⁡(K,α)X∣p=(x1,…,xK)∼Cat⁡(K,p){\displaystyle {\begin{array}{rcccl}{\boldsymbol {\alpha }}&=&\left(\alpha _{1},\ldots ,\alpha _{K}\right)&=&{\text{concentration hyperparameter}}\\\mathbf {p} \mid {\boldsymbol {\alpha }}&=&\left(p_{1},\ldots ,p_{K}\right)&\sim &\operatorname {Dir} (K,{\boldsymbol {\alpha }})\\\mathbb {X} \mid \mathbf {p} &=&\left(\mathbf {x} _{1},\ldots ,\mathbf {x} _{K}\right)&\sim &\operatorname {Cat} (K,\mathbf {p} )\end{array}}}αp∣αX∣p​===​(α1​,…,αK​)(p1​,…,pK​)(x1​,…,xK​)​=∼∼​concentration hyperparameterDir(K,α)Cat(K,p)​

then the following holds: c=(c1,…,cK)=number of occurrences of category ip∣X,α∼Dir⁡(K,c+α)=Dir⁡(K,c1+α1,…,cK+αK){\displaystyle {\begin{array}{rcccl}\mathbf {c} &=&\left(c_{1},\ldots ,c_{K}\right)&=&{\text{number of occurrences of category }}i\\\mathbf {p} \mid \mathbb {X} ,{\boldsymbol {\alpha }}&\sim &\operatorname {Dir} (K,\mathbf {c} +{\boldsymbol {\alpha }})&=&\operatorname {Dir} \left(K,c_{1}+\alpha _{1},\ldots ,c_{K}+\alpha _{K}\right)\end{array}}}cp∣X,α​=∼​(c1​,…,cK​)Dir(K,c+α)​==​number of occurrences of category iDir(K,c1​+α1​,…,cK​+αK​)​

This relationship is used in to estimate the underlying parameter p of a given a collection of N samples. Intuitively, we can view the vector α as , i.e. as representing the number of observations in each category that we have already seen (or guessed). Then we simply add in the counts for all the new observations (the vector c) in order to derive the posterior distribution.

In Bayesian and other with mixture components, Dirichlet distributions are commonly used as the prior distributions for the appearing in the models. See the section on below for more information.

Dirichlet for Bayesian inference for Categorical/Multinomial Distribution

Intuition of Dirichlet Distribution

Let's say you have baised die, i.e probability of every number (class) is not equal. So now we have a categorical/multinomial distribution with unknown parameters based on if you roll it once or multiple times respectively.

Now you want to estimate the parameters of that categorical\multinomial distribution i.e what is the probability of each of the faces of die? Parameters of multinomial distribution is given as

Solution:

and the complete is distribution is given by

Dirichlet Process

Resouces

  • Detailed PDF on Dirichlet Distribution

p=[p1,p2,...pk] where ∑ipi=1\boldsymbol{p} = [p_1, p_2, ... p_k] \text{ where } \sum_ip_i = 1p=[p1​,p2​,...pk​] where i∑​pi​=1

pip_ipi​denotes the the probability of output to belong to class iii.

So now how would estimate the parameters pip_ipi​which are basically of probability of class iii.

Roll out the dice many times, let's say 30. And denote the frequency of each of the output (classes). Let's we rolled the dice for n=30n=30n=30times, and we got outputs with the frequency α1=2,α2=4,α3=10,α4=4,α5=2,α6=8\alpha_1=2, \alpha_2=4, \alpha_3=10, \alpha_4=4, \alpha_5=2, \alpha_6=8 α1​=2,α2​=4,α3​=10,α4​=4,α5​=2,α6​=8. Now what you think might be the value of parameter p3p_3p3​, i would say α3∑jαj=1030=13=0.33\frac{\alpha_3}{\sum_j \alpha_j} = \frac{10}{30} = \frac{1}{3} = 0.33∑j​αj​α3​​=3010​=31​=0.33. So basically we are estimating the parameters which are Random Variable here using the simulations here. But note that p3=0.33p_3 = 0.33p3​=0.33 is just an estimate i.e 0.33 is not the only value possible for p3p_3p3​. Hence basically you can assocaite a probability distribution to each pip_ipi​based on α\boldsymbol{\alpha}α. This probability distribution of p\boldsymbol{p}p based on α\boldsymbol{\alpha}αis nothing but the Dirichlet Distribution.

And to be precise 0.33 is the mean of random variable p3p_3p3​. So, for every iii we have E[pi]=αi∑jαjE[p_i] = \frac{\alpha_i}{\sum_j \alpha_j}E[pi​]=∑j​αj​αi​​.

Dir(p∣α1,α2,...αk)=Γ(∑i=1Kαi)Πi=1KΓ(αi)Πi=1Kpiαi−1Dir(\boldsymbol{p}| \alpha_1, \alpha_2,... \alpha_k) =\frac{ \Gamma (\sum_{i=1}^K \alpha_i)}{\Pi_{i=1}^K \Gamma(\alpha_i) } \Pi_{i=1}^K p_i^{\alpha_i-1}Dir(p∣α1​,α2​,...αk​)=Πi=1K​Γ(αi​)Γ(∑i=1K​αi​)​Πi=1K​piαi​−1​

A Dirichlet process is a probability distribution, whose (process's) range is itself a set of probability distributions. It is often used in to describe the knowledge about the distribution of —how likely it is that the random variables are distributed according to one or another particular distribution.

The Dirichlet process is specified by a base distribution {\displaystyle H} and a positive {\displaystyle \alpha } called the concentration parameter (also known as scaling parameter). The base distribution is the of the process, i.e., the Dirichlet process draws distributions "around" the base distribution the way a draws real numbers around its mean. However, even if the base distribution is , the distributions drawn from the Dirichlet process are . The scaling parameter specifies how strong this discretization is: in the limit of {\displaystyle \alpha \rightarrow 0}, the realizations are all concentrated at a single value, while in the limit of {\displaystyle \alpha \rightarrow \infty } the realizations become continuous. Between the two extremes the realizations are discrete distributions with less and less concentration as {\displaystyle \alpha } increases.

The Dirichlet process can also be seen as the infinite-dimensional generalization of the . In the same way as the Dirichlet distribution is the for the , the Dirichlet process is the conjugate prior for infinite, discrete distributions. A particularly important application of Dirichlet processes is as a distribution in .

Bayesian inference
prior
random variables
Dirichlet distribution
conjugate prior
categorical distribution
nonparametric
prior probability
infinite mixture models
probability density function
Lebesgue measure
Euclidean space
simplex
Pseudo-Observations
normalizing constant
beta function
gamma function
conjugate prior
categorical distribution
discrete probability distribution
multinomial distribution
prior distribution
posterior distribution
Bayesian statistics
categorical distribution
hyperprior
pseudocounts
mixture models
hierarchical Bayesian models
categorical variables
applications
real number
expected value
normal distribution
continuous
almost surely
discrete
https://people.eecs.berkeley.edu/~stephentu/writeups/dirichlet-conjugate-prior.pdfpeople.eecs.berkeley.edu
Visualizing Dirichlet Distributions with Matplotlib
H
K-1
{\displaystyle \{x_{k}\}_{k=1}^{k=K}}
\alpha
\alpha \rightarrow 0
\alpha
{\displaystyle \sum _{i=1}^{K}x_{i}=1{\mbox{ and }}x_{i}\geq 0{\mbox{ for all }}i\in [1,K]}
\alpha \rightarrow \infty