Variational Inference

Summary

Variational Inference allow us to re-write statistical inference problems (i.e. infer the value of a random variable given the value of another random variable) as optimization problems (i.e. find the parameter values that minimize some objective function)

When you have intractable probability distribution $p$ . Variational techniques will try to solve an optimization problem over a class of tractable distributions Q in order to find a q ∈ Q that is most similar to p.

For example, we can use KL divergence between $p$ and $q$ , so that $q$ will be approximate to $p$ which intractable, hence we can use $q$ in place of $p$ .

KL(q||p) = \sum_x q(x)\log \frac{q(x)}{p(x)}

Probabilistic Model

Let's say you have some data denoted by a random variable $X$ . $X$ is basically your observed (I will explain what's meant by observed) variable. Now we believe that there is some random variable $Z$ which is responsible for generating $X$ , but $Z$ is hidden/latent i.e. we don't have any information about $Z$ directly. The idea of latent variable is basically our belief of having a hidden factor $Z$ behind generation of our data $X$ .

Let's take an example of a factory that manufacture boxes. Now we have dataset of number of boxes generated by the factory each day over the year. This is our observed variable $X$ . Why is this called observed variable? Because we have observed this i.e it's given. Now if $X$ is observed, then something must be not observed in contrast, else why would be specifically call it observed.

Now think of factors that can affect $X$ i.e number of boxes produced in a day. There can be multiple factors which influence $X$ such as number of workers on that day, number of machines operating that day, morale of workers that day, etc. Let's take random variable $Z$ which is number of machines working on a day. Now one things to understand is that $Z$ influence $X$ , more number of machines working then more number of boxes produced. But the things is that we don't have any data regarding $Z$ , that's the reason that $Z$ is called hidden/latent variable.

What we want to achieve?

We want to infer about latent/hidden variable $Z$ given the variable $X$ . We want to find how many machine would have been working on some day if I know number of boxes produced that day.

How to the inference then?

Thank's to Bayes, we have the answer to this. We want to infer about $Z$ given the evidence i.e data $X$ .This is done using posterior probability.

Definition of posterior distribution: It's the probability distribution of random variable conditioned on some evidence denoted by $p(Z|X)$ .

Using Bayes theorem:

P(Z|X) = \frac{P(X|Z)p(Z)}{P(X)} = \frac{P(X|Z)P(Z)}{\int_z P(X|Z)P(Z)}

Calculating the posterior

We have basically three terms:

$P(X|Z)$ : Likelihood. This is basically model of observing $X$ if I would have known latent variable $Z$ . How do we get $P(X|Z)$ ? We assume a model for $X$ given $Z$ i.e I would know what distribution followed by X if I know $Z$ . For example. I can say that $X$ follows Gaussian with $Z$ as it's mean with some constant variance. Or if you don't know exact model between $X$ and $Z$ , you can parametrize $P(X|Z)$ with some parameter $\theta$ as $P(X|Z,\theta)$ and you learn that $\theta$ through different learning techniques. I don't know more about this concretely and have to see how exactly is this done. But, let's just assume we know this and it isn't a problem for now.

$P(Z)$ : Prior. This simply denotes your prior belief about $Z$ . Mostly considered non-informative. Or if you have any prior belief, use it's density function.

$P(X) = \int_z P(X|Z)P(Z)$ : Evidence likelihood. Now this is probability of observing $X$ in total. Think here that $X=x$ have some particular value. We can get this by integrating over all possible $Z$ and then use likelihood of $X$ for each $Z$ . This is basically marginalization over $Z$ . Think in that terms and you would understand what it means.

Problem of Intractable Posterior.

Now what here is the problem of Intractable integral. So, what happens is that calculating the integral for $P(X)$ many times it's intractable. For example, in our case $Z=$ number of machines running on a day. Here $Z$ is single dimensional, hence calculating the integral is possible. But consider if $Z$ is $d$ dimensional vector, in this case integral becomes intractable as we have integrate over $d$ dimensions as $\int_1\int_2..\int_dP(X|z_1,z_2,..z_d)P(z_1,z_2,..z_d)$ . Calculating this is very difficult. For example, let say you have $n$ facial images of $K$ different persons. But in the data you only have images and identity is not given. Now here $X$ is some given image and $Z$ is identity of person whose the image belong. Now what is meant to calculate the posterior $P(Z|X)$ ? It will simple mean to find the identity of person of a person given the image. In this what would mean to calculate $P(X) = \int_z P(X|Z)P(Z)$ ? This means to integrate over identity $Z$ of peoples which is $d$ dimensional vector, in words it's taking some person and finding the probability of getting that facial image from that person and sum it for all the persons. Now this is intractable as large dimension of identity.

So, in most cases where $Z$ is large dimensional vector, it becomes intractable to calculate the posterior distribution.

Solution: Variational Inference

The idea is really simple: If we can’t get a tractable closed-form solution for $P(Z|X)$ , we’ll approximate it.

Let the approximation be $Q(Z;ϕ)$ and we can now form this as an optimization problem:

\phi^* = \arg min_ϕ KL[Q(Z;ϕ) || P(Z|X)]

By choosing a family of distribution $Q(Z;ϕ)$ flexible enough to model $P(Z|X)$ and optimizing over $ϕ$ , we can push the approximation towards the real posterior.

Now let’s expand the KL-divergence term

KL[Q(Z;ϕ) || P(Z|X)] =E_Q[\log Q(Z;ϕ)]−E_Q[\log P(Z|X)] \\ =E_Q[\log Q(Z;ϕ)]−E_Q[\log \frac{P(X,Z)}{P(X)}] \\ =E_Q[\log Q(Z;ϕ)]−E_Q[\log P(X,Z)]+ \log P(X)

Although we can compute the first two terms in the above expansion, but oh lord ! the third term is the same annoying (intractable) integral we were avoiding before. What do we do now ? This seems to be a deadlock !

The Evidence Lower BOund (ELBO)

Please recall that our original objective was a minimization problem over $Q(⋅;ϕ)$ . We can pull a little trick here - we can optimize only the first two terms and ignore the third term. How ?

Because the third term is independent of $Q(⋅;ϕ)$ . So, we just need to minimize

E_ Q[\log Q(Z;ϕ)]−E_Q[\log P(X,Z)]

Or equivalently, maximize (just flip the two terms)

ELBO(Q)≜E_Q[\log P(X,Z)]−E_Q[\log Q(Z;ϕ)]

This term, usually defined as ELBO, is quite famous in VI literature and you have just witnessed how it looks like and where it came from. Taking a deeper look into the ELBO(⋅)

ELBO(Q)=E_Q[\log P(X|Z)]+E_Q[\log P(Z)]−E_Q[\log Q(Z;ϕ)]\\ =E_Q[\log P(X|Z)]−KL[Q(Z;ϕ) || P(Z)]

Now, please consider looking at the last equation for a while because that is what all our efforts led us to. The last equation is totally tractable and also solves our problem. What it basically says is that maximizing ELBO(⋅ (which is a proxy objective for our original optimization problem) is equivalent to maximizing the conditional data likelihood (which we can choose in our graphical model design) and simultaneously pushing our approximate posterior (i.e., $Q(;ϕ)$ ) towards a prior over $Z$ . The prior $P(Z)$ is basically how the true latent space is organized. Now the immediate question might arise: “Where do we get $P(Z)$ from?”. The answer is, we can just choose any distribution as a hypothesis. It will be our belief of how the $Z$ space is organized.Fig.5: Interpretation of ELBO

There is one more interpretation (see figure 5) of the KL-divergence expansion that is interesting to us. Rewriting the KL-expansion and substituting ELBO(⋅)ELBO(⋅) definition, we get

\log P(X)=ELBO(Q)+KL[Q(Z;ϕ) || P(Z|X)]

As we know that $KL(⋅||⋅)≥0$ for any two distributions, the following inequality holds

\log P(X)≥ELBO(Q)

So, the $ELBO(⋅)$ that we vowed to maximize is a lower bound on the observed data log likelihood. Thats amazing, isn’t it ! Just by maximing the $ELBO(⋅)$ , we can implicitely get closer to our dream of estimating maximum (log)-likelihood - tighter the bound, better the approximation.

Okay ! Way too much math for today. This is overall how the Variational Inference looks like. There are numerous

Resources

PreviousBatch Normalization NextVAE: Variational Auto-Encoders

Last updated 10 months ago