Conjugate Priors
Last updated
Last updated
In Bayesian probability theory, if the posterior distributions p(θ | x) are in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. For example, the Gaussian family is conjugate to itself (or self-conjugate) with respect to a Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior over the mean will ensure that the posterior distribution is also Gaussian. This means that the Gaussian distribution is a conjugate prior for the likelihood that is also Gaussian.
A conjugate prior is an algebraic convenience, giving a closed-form expression for the posterior; otherwise numerical integration may be necessary. Further, conjugate priors may give intuition, by more transparently showing how a likelihood function updates a prior distribution.
Consider the general problem of inferring a (continuous) distribution for a parameter θ given some datum or data x. From Bayes' theorem, the posterior distribution is equal to the product of the likelihood function and prior , normalized (divided) by the probability of the data :
Let the likelihood function be considered fixed; the likelihood function is usually well-determined from a statement of the data-generating process, hence this liklihood faunction is also called sampling distribution. It is clear that different choices of the prior distribution p(θ) may make the integral more or less difficult to calculate, and the product p(x|θ) × p(θ) may take one algebraic form or another. For certain choices of the prior, the posterior has the same algebraic form as the prior (generally with different parameter values). Such a choice is a conjugate prior.
Check above to see how conjugate prior helps in analytically solve the posterior.
It is often useful to think of the hyperparameters of a conjugate prior distribution as corresponding to having observed a certain number of pseudo-observations with properties specified by the parameters. In general, for nearly all conjugate prior distributions, the hyperparameters can be interpreted in terms of pseudo-observations. This can help both in providing an intuition behind the often messy update equations, as well as to help choose reasonable hyperparameters for a prior.
Look at the example here to understand: https://en.wikipedia.org/wiki/Conjugate_prior
Priors Hyper parameters are the parameters of the prior distribution. For example, vector in dirichlet distribution, in Beta Distribution, etc.
In the first answer, see how represents Pseudo-count in Dirichlet Distribution. Basically, choosing the correct value of allows you to get the right prior about the parameters. Hence, those basically encoded the prior information you have. In the above exmample, and guess of how many balls of each color are there in the bag decided your prior and represents by appropriate value of .