Dirichlet Distribution/Process

Dirichlet Distribution

Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.

The Dirichlet distribution of order K ≥ 2 with parameters α1, ..., αK > 0 has a probability density function with respect to Lebesgue measure on the Euclidean space RK−1 given by $f\left(x_{1},\ldots ,x_{K};\alpha _{1},\ldots ,\alpha _{K}\right)={\frac {1}{\mathrm {B} ({\boldsymbol {\alpha }})}}\prod _{i=1}^{K}x_{i}^{\alpha _{i}-1}$ where $\{x_{k}\}_{k=1}^{k=K}$ belong to the standard K-1 simplex, or in other words: $\sum _{i=1}^{K}x_{i}=1{\mbox{ and }}x_{i}\geq 0{\mbox{ for all }}i\in [1,K]$

Here $\alpha$ can be interpreted as "prior observation count". See Pseudo-Observations.

The normalizing constant is the multivariate beta function, which can be expressed in terms of the gamma function: $\mathrm {B} ({\boldsymbol {\alpha }})={\frac {\prod _{i=1}^{K}\Gamma (\alpha _{i})}{\Gamma \left(\sum _{i=1}^{K}\alpha _{i}\right)}},\qquad {\boldsymbol {\alpha }}=(\alpha _{1},\ldots ,\alpha _{K}).$

As Conjugate Prior

The Dirichlet distribution is the conjugate prior distribution of the categorical distribution (a generic discrete probability distribution with a given number of possible outcomes) and multinomial distribution (the distribution over observed counts of each possible category in a set of categorically distributed observations). This means that if a data point has either a categorical or multinomial distribution, and the prior distribution of the distribution's parameter (the vector of probabilities that generates the data point) is distributed as a Dirichlet, then the posterior distribution of the parameter is also a Dirichlet. Intuitively, in such a case, starting from what we know about the parameter prior to observing the data point, we then can update our knowledge based on the data point and end up with a new distribution of the same form as the old one. This means that we can successively update our knowledge of a parameter by incorporating new observations one at a time, without running into mathematical difficulties.

Formally, this can be expressed as follows. Given a model ${\begin{array}{rcccl}{\boldsymbol {\alpha }}&=&\left(\alpha _{1},\ldots ,\alpha _{K}\right)&=&{\text{concentration hyperparameter}}\\\mathbf {p} \mid {\boldsymbol {\alpha }}&=&\left(p_{1},\ldots ,p_{K}\right)&\sim &\operatorname {Dir} (K,{\boldsymbol {\alpha }})\\\mathbb {X} \mid \mathbf {p} &=&\left(\mathbf {x} _{1},\ldots ,\mathbf {x} _{K}\right)&\sim &\operatorname {Cat} (K,\mathbf {p} )\end{array}}$

then the following holds: ${\begin{array}{rcccl}\mathbf {c} &=&\left(c_{1},\ldots ,c_{K}\right)&=&{\text{number of occurrences of category }}i\\\mathbf {p} \mid \mathbb {X} ,{\boldsymbol {\alpha }}&\sim &\operatorname {Dir} (K,\mathbf {c} +{\boldsymbol {\alpha }})&=&\operatorname {Dir} \left(K,c_{1}+\alpha _{1},\ldots ,c_{K}+\alpha _{K}\right)\end{array}}$

This relationship is used in Bayesian statistics to estimate the underlying parameter p of a categorical distribution given a collection of N samples. Intuitively, we can view the hyperprior vector α as pseudocounts, i.e. as representing the number of observations in each category that we have already seen (or guessed). Then we simply add in the counts for all the new observations (the vector c) in order to derive the posterior distribution.

In Bayesian mixture models and other hierarchical Bayesian models with mixture components, Dirichlet distributions are commonly used as the prior distributions for the categorical variables appearing in the models. See the section on applications below for more information.

Dirichlet for Bayesian inference for Categorical/Multinomial Distribution

https://people.eecs.berkeley.edu/~stephentu/writeups/dirichlet-conjugate-prior.pdfpeople.eecs.berkeley.edu

Visualizing Dirichlet Distributions with Matplotlib

Intuition of Dirichlet Distribution

Let's say you have baised die, i.e probability of every number (class) is not equal. So now we have a categorical/multinomial distribution with unknown parameters based on if you roll it once or multiple times respectively.

Now you want to estimate the parameters of that categorical\multinomial distribution i.e what is the probability of each of the faces of die? Parameters of multinomial distribution is given as

\boldsymbol{p} = [p_1, p_2, ... p_k] \text{ where } \sum_ip_i = 1

$p_i$ denotes the the probability of output to belong to class $i$ .

So now how would estimate the parameters $p_i$ which are basically of probability of class $i$ .

Solution:

Roll out the dice many times, let's say 30. And denote the frequency of each of the output (classes). Let's we rolled the dice for $n=30$ times, and we got outputs with the frequency $\alpha_1=2, \alpha_2=4, \alpha_3=10, \alpha_4=4, \alpha_5=2, \alpha_6=8$ . Now what you think might be the value of parameter $p_3$ , i would say $\frac{\alpha_3}{\sum_j \alpha_j} = \frac{10}{30} = \frac{1}{3} = 0.33$ . So basically we are estimating the parameters which are Random Variable here using the simulations here. But note that $p_3 = 0.33$ is just an estimate i.e 0.33 is not the only value possible for $p_3$ . Hence basically you can assocaite a probability distribution to each $p_i$ based on $\boldsymbol{\alpha}$ . This probability distribution of $\boldsymbol{p}$ based on $\boldsymbol{\alpha}$ is nothing but the Dirichlet Distribution.

And to be precise 0.33 is the mean of random variable $p_3$ . So, for every $i$ we have $E[p_i] = \frac{\alpha_i}{\sum_j \alpha_j}$ .

and the complete is distribution is given by

Dir(\boldsymbol{p}| \alpha_1, \alpha_2,... \alpha_k) =\frac{ \Gamma (\sum_{i=1}^K \alpha_i)}{\Pi_{i=1}^K \Gamma(\alpha_i) } \Pi_{i=1}^K p_i^{\alpha_i-1}

Dirichlet Process

A Dirichlet process is a probability distribution, whose (process's) range is itself a set of probability distributions. It is often used in Bayesian inference to describe the prior knowledge about the distribution of random variables—how likely it is that the random variables are distributed according to one or another particular distribution.

The Dirichlet process is specified by a base distribution {\displaystyle H} and a positive real number {\displaystyle \alpha } $\alpha$ called the concentration parameter (also known as scaling parameter). The base distribution is the expected value of the process, i.e., the Dirichlet process draws distributions "around" the base distribution the way a normal distribution draws real numbers around its mean. However, even if the base distribution is continuous, the distributions drawn from the Dirichlet process are almost surely discrete. The scaling parameter specifies how strong this discretization is: in the limit of {\displaystyle \alpha \rightarrow 0} $\alpha \rightarrow 0$ , the realizations are all concentrated at a single value, while in the limit of {\displaystyle \alpha \rightarrow \infty } $\alpha \rightarrow \infty$ the realizations become continuous. Between the two extremes the realizations are discrete distributions with less and less concentration as {\displaystyle \alpha } $\alpha$ increases.

The Dirichlet process can also be seen as the infinite-dimensional generalization of the Dirichlet distribution. In the same way as the Dirichlet distribution is the conjugate prior for the categorical distribution, the Dirichlet process is the conjugate prior for infinite, nonparametric discrete distributions. A particularly important application of Dirichlet processes is as a prior probability distribution in infinite mixture models.

Resouces

Detailed PDF on Dirichlet Distribution

PreviousConjugate Priors NextPosterior Predictive Distribution

Last updated 5 years ago