Dirichlet Distribution/Process

Dirichlet Distribution

Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.

The Dirichlet distribution of order K ≥ 2 with parameters α1, ..., αK > 0 has a probability density functionarrow-up-right with respect to Lebesgue measurearrow-up-right on the Euclidean spacearrow-up-right RK−1 given by f(x1,,xK;α1,,αK)=1B(α)i=1Kxiαi1{\displaystyle f\left(x_{1},\ldots ,x_{K};\alpha _{1},\ldots ,\alpha _{K}\right)={\frac {1}{\mathrm {B} ({\boldsymbol {\alpha }})}}\prod _{i=1}^{K}x_{i}^{\alpha _{i}-1}} where{\displaystyle \{x_{k}\}_{k=1}^{k=K}} belong to the standardK-1 simplexarrow-up-right, or in other words: {\displaystyle \sum _{i=1}^{K}x_{i}=1{\mbox{ and }}x_{i}\geq 0{\mbox{ for all }}i\in [1,K]}

Here α\alphacan be interpreted as "prior observation count". See Pseudo-Observations.

The normalizing constantarrow-up-right is the multivariate beta functionarrow-up-right, which can be expressed in terms of the gamma functionarrow-up-right: B(α)=i=1KΓ(αi)Γ(i=1Kαi),α=(α1,,αK).{\displaystyle \mathrm {B} ({\boldsymbol {\alpha }})={\frac {\prod _{i=1}^{K}\Gamma (\alpha _{i})}{\Gamma \left(\sum _{i=1}^{K}\alpha _{i}\right)}},\qquad {\boldsymbol {\alpha }}=(\alpha _{1},\ldots ,\alpha _{K}).}

As Conjugate Prior

The Dirichlet distribution is the conjugate priorarrow-up-right distribution of the categorical distributionarrow-up-right (a generic discrete probability distributionarrow-up-right with a given number of possible outcomes) and multinomial distributionarrow-up-right (the distribution over observed counts of each possible category in a set of categorically distributed observations). This means that if a data point has either a categorical or multinomial distribution, and the prior distributionarrow-up-right of the distribution's parameter (the vector of probabilities that generates the data point) is distributed as a Dirichlet, then the posterior distributionarrow-up-right of the parameter is also a Dirichlet. Intuitively, in such a case, starting from what we know about the parameter prior to observing the data point, we then can update our knowledge based on the data point and end up with a new distribution of the same form as the old one. This means that we can successively update our knowledge of a parameter by incorporating new observations one at a time, without running into mathematical difficulties.

Formally, this can be expressed as follows. Given a model α=(α1,,αK)=concentration hyperparameterpα=(p1,,pK)Dir(K,α)Xp=(x1,,xK)Cat(K,p){\displaystyle {\begin{array}{rcccl}{\boldsymbol {\alpha }}&=&\left(\alpha _{1},\ldots ,\alpha _{K}\right)&=&{\text{concentration hyperparameter}}\\\mathbf {p} \mid {\boldsymbol {\alpha }}&=&\left(p_{1},\ldots ,p_{K}\right)&\sim &\operatorname {Dir} (K,{\boldsymbol {\alpha }})\\\mathbb {X} \mid \mathbf {p} &=&\left(\mathbf {x} _{1},\ldots ,\mathbf {x} _{K}\right)&\sim &\operatorname {Cat} (K,\mathbf {p} )\end{array}}}

then the following holds: c=(c1,,cK)=number of occurrences of category ipX,αDir(K,c+α)=Dir(K,c1+α1,,cK+αK){\displaystyle {\begin{array}{rcccl}\mathbf {c} &=&\left(c_{1},\ldots ,c_{K}\right)&=&{\text{number of occurrences of category }}i\\\mathbf {p} \mid \mathbb {X} ,{\boldsymbol {\alpha }}&\sim &\operatorname {Dir} (K,\mathbf {c} +{\boldsymbol {\alpha }})&=&\operatorname {Dir} \left(K,c_{1}+\alpha _{1},\ldots ,c_{K}+\alpha _{K}\right)\end{array}}}

This relationship is used in Bayesian statisticsarrow-up-right to estimate the underlying parameter p of a categorical distributionarrow-up-right given a collection of N samples. Intuitively, we can view the hyperpriorarrow-up-right vector α as pseudocountsarrow-up-right, i.e. as representing the number of observations in each category that we have already seen (or guessed). Then we simply add in the counts for all the new observations (the vector c) in order to derive the posterior distribution.

In Bayesian mixture modelsarrow-up-right and other hierarchical Bayesian modelsarrow-up-right with mixture components, Dirichlet distributions are commonly used as the prior distributions for the categorical variablesarrow-up-right appearing in the models. See the section on applicationsarrow-up-right below for more information.

Dirichlet for Bayesian inference for Categorical/Multinomial Distribution

Intuition of Dirichlet Distribution

Let's say you have baised die, i.e probability of every number (class) is not equal. So now we have a categorical/multinomial distribution with unknown parameters based on if you roll it once or multiple times respectively.

Now you want to estimate the parameters of that categorical\multinomial distribution i.e what is the probability of each of the faces of die? Parameters of multinomial distribution is given as

p=[p1,p2,...pk] where ipi=1\boldsymbol{p} = [p_1, p_2, ... p_k] \text{ where } \sum_ip_i = 1

pip_idenotes the the probability of output to belong to class ii.

So now how would estimate the parameters pip_iwhich are basically of probability of class ii.

Solution:

Roll out the dice many times, let's say 30. And denote the frequency of each of the output (classes). Let's we rolled the dice for n=30n=30times, and we got outputs with the frequency α1=2,α2=4,α3=10,α4=4,α5=2,α6=8\alpha_1=2, \alpha_2=4, \alpha_3=10, \alpha_4=4, \alpha_5=2, \alpha_6=8 . Now what you think might be the value of parameter p3p_3, i would say α3jαj=1030=13=0.33\frac{\alpha_3}{\sum_j \alpha_j} = \frac{10}{30} = \frac{1}{3} = 0.33. So basically we are estimating the parameters which are Random Variable here using the simulations here. But note that p3=0.33p_3 = 0.33 is just an estimate i.e 0.33 is not the only value possible for p3p_3. Hence basically you can assocaite a probability distribution to each pip_ibased on α\boldsymbol{\alpha}. This probability distribution of p\boldsymbol{p} based on α\boldsymbol{\alpha}is nothing but the Dirichlet Distribution.

And to be precise 0.33 is the mean of random variable p3p_3. So, for every ii we have E[pi]=αijαjE[p_i] = \frac{\alpha_i}{\sum_j \alpha_j}.

and the complete is distribution is given by

Dir(pα1,α2,...αk)=Γ(i=1Kαi)Πi=1KΓ(αi)Πi=1Kpiαi1Dir(\boldsymbol{p}| \alpha_1, \alpha_2,... \alpha_k) =\frac{ \Gamma (\sum_{i=1}^K \alpha_i)}{\Pi_{i=1}^K \Gamma(\alpha_i) } \Pi_{i=1}^K p_i^{\alpha_i-1}

Dirichlet Process

A Dirichlet process is a probability distribution, whose (process's) range is itself a set of probability distributions. It is often used in Bayesian inferencearrow-up-right to describe the priorarrow-up-right knowledge about the distribution of random variablesarrow-up-right—how likely it is that the random variables are distributed according to one or another particular distribution.

The Dirichlet process is specified by a base distribution {\displaystyle H}H and a positive real numberarrow-up-right {\displaystyle \alpha }\alpha called the concentration parameter (also known as scaling parameter). The base distribution is the expected valuearrow-up-right of the process, i.e., the Dirichlet process draws distributions "around" the base distribution the way a normal distributionarrow-up-right draws real numbers around its mean. However, even if the base distribution is continuousarrow-up-right, the distributions drawn from the Dirichlet process are almost surelyarrow-up-right discretearrow-up-right. The scaling parameter specifies how strong this discretization is: in the limit of {\displaystyle \alpha \rightarrow 0}\alpha \rightarrow 0, the realizations are all concentrated at a single value, while in the limit of {\displaystyle \alpha \rightarrow \infty }\alpha \rightarrow \infty the realizations become continuous. Between the two extremes the realizations are discrete distributions with less and less concentration as {\displaystyle \alpha }\alpha increases.

The Dirichlet process can also be seen as the infinite-dimensional generalization of the Dirichlet distributionarrow-up-right. In the same way as the Dirichlet distribution is the conjugate priorarrow-up-right for the categorical distributionarrow-up-right, the Dirichlet process is the conjugate prior for infinite, nonparametricarrow-up-right discrete distributions. A particularly important application of Dirichlet processes is as a prior probabilityarrow-up-right distribution in infinite mixture modelsarrow-up-right.

Resouces

  • Detailed PDF on Dirichlet Distribution

Last updated