Conjugate Priors

Definition

In Bayesian probabilityarrow-up-right theory, if the posterior distributionsarrow-up-right p(θ | x) are in the same probability distribution familyarrow-up-right as the prior probability distributionarrow-up-right p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood functionarrow-up-right. For example, the Gaussianarrow-up-right family is conjugate to itself (or self-conjugate) with respect to a Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior over the mean will ensure that the posterior distribution is also Gaussian. This means that the Gaussian distribution is a conjugate prior for the likelihood that is also Gaussian.

A conjugate prior is an algebraic convenience, giving a closed-form expressionarrow-up-right for the posterior; otherwise numerical integrationarrow-up-right may be necessary. Further, conjugate priors may give intuition, by more transparently showing how a likelihood function updates a prior distribution.

Consider the general problem of inferring a (continuous) distribution for a parameter θ given some datum or data x. From Bayes' theoremarrow-up-right, the posterior distribution is equal to the product of the likelihood function \theta \mapsto p(x\mid \theta )\! and prior p(\theta )\!, normalized (divided) by the probability of the data p(x)\!:{\displaystyle {\begin{aligned}p(\theta \mid x)&={\frac {p(x\mid \theta )\,p(\theta )}{p(x)}}\\&={\frac {p(x\mid \theta )\,p(\theta )}{\int p(x\mid \theta ')\,p(\theta ')\,d\theta '}}\end{aligned}}}

Let the likelihood function be considered fixed; the likelihood function is usually well-determined from a statement of the data-generating process, hence this liklihood faunction is also called sampling distribution. It is clear that different choices of the prior distribution p(θ) may make the integral more or less difficult to calculate, and the product p(x|θ) × p(θ) may take one algebraic form or another. For certain choices of the prior, the posterior has the same algebraic form as the prior (generally with different parameter values). Such a choice is a conjugate prior.

Check above to see how conjugate prior helps in analytically solve the posterior.

Pseudo-observations/Priors Hyperparameters.

Priors Hyper parameters are the parameters of the prior distribution. For example, α\alphavector in dirichlet distribution, (α,β)(\alpha, \beta)in Beta Distribution, etc.

It is often useful to think of the hyperparameters of a conjugate prior distribution as corresponding to having observed a certain number of pseudo-observations with properties specified by the parameters. In general, for nearly all conjugate prior distributions, the hyperparameters can be interpreted in terms of pseudo-observations. This can help both in providing an intuition behind the often messy update equations, as well as to help choose reasonable hyperparameters for a prior.

Look at the example here to understand: https://en.wikipedia.org/wiki/Conjugate_priorarrow-up-right

In the first answer, see how α\alpha represents Pseudo-count in Dirichlet Distribution. Basically, choosing the correct value of α\alphaallows you to get the right prior about the parameters. Hence, those α\alphabasically encoded the prior information you have. In the above exmample, and guess of how many balls of each color are there in the bag decided your prior and represents by appropriate value of α\alpha.

Resources

Last updated