Conjugate Priors
Last updated
Last updated
In theory, if the p(θ | x) are in the same as the p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the . For example, the family is conjugate to itself (or self-conjugate) with respect to a Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior over the mean will ensure that the posterior distribution is also Gaussian. This means that the Gaussian distribution is a conjugate prior for the likelihood that is also Gaussian.
A conjugate prior is an algebraic convenience, giving a for the posterior; otherwise may be necessary. Further, conjugate priors may give intuition, by more transparently showing how a likelihood function updates a prior distribution.
Consider the general problem of inferring a (continuous) distribution for a parameter θ given some datum or data x. From , the posterior distribution is equal to the product of the likelihood function and prior , normalized (divided) by the probability of the data :
Let the likelihood function be considered fixed; the likelihood function is usually well-determined from a statement of the data-generating process, hence this liklihood faunction is also called sampling distribution. It is clear that different choices of the prior distribution p(θ) may make the integral more or less difficult to calculate, and the product p(x|θ) × p(θ) may take one algebraic form or another. For certain choices of the prior, the posterior has the same algebraic form as the prior (generally with different parameter values). Such a choice is a conjugate prior.
Check above to see how conjugate prior helps in analytically solve the posterior.
Priors Hyper parameters are the parameters of the prior distribution. For example, vector in dirichlet distribution, in Beta Distribution, etc.
It is often useful to think of the hyperparameters of a conjugate prior distribution as corresponding to having observed a certain number of pseudo-observations with properties specified by the parameters. In general, for nearly all conjugate prior distributions, the hyperparameters can be interpreted in terms of pseudo-observations. This can help both in providing an intuition behind the often messy update equations, as well as to help choose reasonable hyperparameters for a prior.
In the first answer, see how represents Pseudo-count in Dirichlet Distribution. Basically, choosing the correct value of allows you to get the right prior about the parameters. Hence, those basically encoded the prior information you have. In the above exmample, and guess of how many balls of each color are there in the bag decided your prior and represents by appropriate value of .
Look at the example here to understand: