Variational Inference
Last updated
Last updated
Variational Inference allow us to re-write statistical inference problems (i.e. infer the value of a random variable given the value of another random variable) as optimization problems (i.e. find the parameter values that minimize some objective function)
When you have intractable probability distribution . Variational techniques will try to solve an optimization problem over a class of tractable distributions Q in order to find a q ∈ Q that is most similar to p.
For example, we can use KL divergence between and , so that will be approximate to which intractable, hence we can use in place of .
Let's say you have some data denoted by a random variable . is basically your observed (I will explain what's meant by observed) variable. Now we believe that there is some random variable which is responsible for generating , but is hidden/latent i.e. we don't have any information about directly. The idea of latent variable is basically our belief of having a hidden factor behind generation of our data .
Let's take an example of a factory that manufacture boxes. Now we have dataset of number of boxes generated by the factory each day over the year. This is our observed variable . Why is this called observed variable? Because we have observed this i.e it's given. Now if is observed, then something must be not observed in contrast, else why would be specifically call it observed.
Now think of factors that can affect i.e number of boxes produced in a day. There can be multiple factors which influence such as number of workers on that day, number of machines operating that day, morale of workers that day, etc. Let's take random variable which is number of machines working on a day. Now one things to understand is that influence , more number of machines working then more number of boxes produced. But the things is that we don't have any data regarding , that's the reason that is called hidden/latent variable.
We want to infer about latent/hidden variable given the variable . We want to find how many machine would have been working on some day if I know number of boxes produced that day.
Thank's to Bayes, we have the answer to this. We want to infer about given the evidence i.e data.This is done using posterior probability.
Definition of posterior distribution: It's the probability distribution of random variable conditioned on some evidence denoted by .
Using Bayes theorem:
We have basically three terms:
: Likelihood. This is basically model of observing if I would have known latent variable . How do we get ? We assume a model for given i.e I would know what distribution followed by X if I know . For example. I can say that follows Gaussian with as it's mean with some constant variance. Or if you don't know exact model between and , you can parametrize with some parameter as and you learn that through different learning techniques. I don't know more about this concretely and have to see how exactly is this done. But, let's just assume we know this and it isn't a problem for now.
: Prior. This simply denotes your prior belief about . Mostly considered non-informative. Or if you have any prior belief, use it's density function.
: Evidence likelihood. Now this is probability of observing in total. Think here that have some particular value. We can get this by integrating over all possible and then use likelihood of for each . This is basically marginalization over . Think in that terms and you would understand what it means.
Now what here is the problem of Intractable integral. So, what happens is that calculating the integral for many times it's intractable. For example, in our case number of machines running on a day. Here is single dimensional, hence calculating the integral is possible. But consider if is dimensional vector, in this case integral becomes intractable as we have integrate over dimensions as . Calculating this is very difficult. For example, let say you have facial images of different persons. But in the data you only have images and identity is not given. Now here is some given image and is identity of person whose the image belong. Now what is meant to calculate the posterior ? It will simple mean to find the identity of person of a person given the image. In this what would mean to calculate ? This means to integrate over identity of peoples which is dimensional vector, in words it's taking some person and finding the probability of getting that facial image from that person and sum it for all the persons. Now this is intractable as large dimension of identity.
So, in most cases where is large dimensional vector, it becomes intractable to calculate the posterior distribution.
The idea is really simple: If we can’t get a tractable closed-form solution for , we’ll approximate it.
Let the approximation be and we can now form this as an optimization problem:
By choosing a family of distribution flexible enough to model and optimizing over , we can push the approximation towards the real posterior.
Now let’s expand the KL-divergence term
Although we can compute the first two terms in the above expansion, but oh lord ! the third term is the same annoying (intractable) integral we were avoiding before. What do we do now ? This seems to be a deadlock !
Please recall that our original objective was a minimization problem over . We can pull a little trick here - we can optimize only the first two terms and ignore the third term. How ?
Because the third term is independent of . So, we just need to minimize
Or equivalently, maximize (just flip the two terms)
This term, usually defined as ELBO, is quite famous in VI literature and you have just witnessed how it looks like and where it came from. Taking a deeper look into the ELBO(⋅)
There is one more interpretation (see figure 5) of the KL-divergence expansion that is interesting to us. Rewriting the KL-expansion and substituting ELBO(⋅)ELBO(⋅) definition, we get
As we know that for any two distributions, the following inequality holds
So, the that we vowed to maximize is a lower bound on the observed data log likelihood. Thats amazing, isn’t it ! Just by maximing the , we can implicitely get closer to our dream of estimating maximum (log)-likelihood - tighter the bound, better the approximation.
Okay ! Way too much math for today. This is overall how the Variational Inference looks like. There are numerous
Now, please consider looking at the last equation for a while because that is what all our efforts led us to. The last equation is totally tractable and also solves our problem. What it basically says is that maximizing ELBO(⋅ (which is a proxy objective for our original optimization problem) is equivalent to maximizing the conditional data likelihood (which we can choose in our graphical model design) and simultaneously pushing our approximate posterior (i.e., ) towards a prior over . The prior is basically how the true latent space is organized. Now the immediate question might arise: “Where do we get from?”. The answer is, we can just choose any distribution as a hypothesis. It will be our belief of how the space is organized.Fig.5: Interpretation of ELBO