Variational Inference

Summary

Variational Inference allow us to re-write statistical inference problems (i.e. infer the value of a random variable given the value of another random variable) as optimization problems (i.e. find the parameter values that minimize some objective function)

When you have intractable probability distribution pp. Variational techniques will try to solve an optimization problem over a class of tractable distributions Q in order to find a q ∈ Q that is most similar to p.

For example, we can use KL divergence between pp and qq, so that qq will be approximate to pp which intractable, hence we can use qq in place of pp.

KL(qp)=xq(x)logq(x)p(x)KL(q||p) = \sum_x q(x)\log \frac{q(x)}{p(x)}

Probabilistic Model

Let's say you have some data denoted by a random variable XX. XXis basically your observed (I will explain what's meant by observed) variable. Now we believe that there is some random variable ZZwhich is responsible for generating XX, but ZZis hidden/latent i.e. we don't have any information about ZZdirectly. The idea of latent variable is basically our belief of having a hidden factor ZZ behind generation of our data XX.

Let's take an example of a factory that manufacture boxes. Now we have dataset of number of boxes generated by the factory each day over the year. This is our observed variable XX. Why is this called observed variable? Because we have observed this i.e it's given. Now if XXis observed, then something must be not observed in contrast, else why would be specifically call it observed.

Now think of factors that can affect XX i.e number of boxes produced in a day. There can be multiple factors which influence XXsuch as number of workers on that day, number of machines operating that day, morale of workers that day, etc. Let's take random variable ZZwhich is number of machines working on a day. Now one things to understand is that ZZinfluence XX, more number of machines working then more number of boxes produced. But the things is that we don't have any data regarding ZZ, that's the reason that ZZis called hidden/latent variable.

What we want to achieve?

We want to infer about latent/hidden variable ZZgiven the variable XX. We want to find how many machine would have been working on some day if I know number of boxes produced that day.

How to the inference then?

Thank's to Bayes, we have the answer to this. We want to infer about ZZgiven the evidence i.e dataXX.This is done using posterior probability.

Definition of posterior distribution: It's the probability distribution of random variable conditioned on some evidence denoted by p(ZX)p(Z|X).

Using Bayes theorem:

P(ZX)=P(XZ)p(Z)P(X)=P(XZ)P(Z)zP(XZ)P(Z)P(Z|X) = \frac{P(X|Z)p(Z)}{P(X)} = \frac{P(X|Z)P(Z)}{\int_z P(X|Z)P(Z)}

Calculating the posterior

We have basically three terms:

P(XZ)P(X|Z): Likelihood. This is basically model of observing XXif I would have known latent variable ZZ. How do we get P(XZ)P(X|Z)? We assume a model for XXgiven ZZi.e I would know what distribution followed by X if I know ZZ. For example. I can say that XXfollows Gaussian with ZZas it's mean with some constant variance. Or if you don't know exact model between XXand ZZ, you can parametrize P(XZ)P(X|Z)with some parameter θ\thetaas P(XZ,θ)P(X|Z,\theta)and you learn that θ\thetathrough different learning techniques. I don't know more about this concretely and have to see how exactly is this done. But, let's just assume we know this and it isn't a problem for now.

P(Z)P(Z) : Prior. This simply denotes your prior belief about ZZ. Mostly considered non-informative. Or if you have any prior belief, use it's density function.

P(X)=zP(XZ)P(Z)P(X) = \int_z P(X|Z)P(Z): Evidence likelihood. Now this is probability of observing XXin total. Think here that X=xX=xhave some particular value. We can get this by integrating over all possible ZZand then use likelihood of XXfor each ZZ . This is basically marginalization over ZZ. Think in that terms and you would understand what it means.

Problem of Intractable Posterior.

Now what here is the problem of Intractable integral. So, what happens is that calculating the integral for P(X)P(X)many times it's intractable. For example, in our case Z=Z=number of machines running on a day. Here ZZis single dimensional, hence calculating the integral is possible. But consider if ZZis dd dimensional vector, in this case integral becomes intractable as we have integrate over dddimensions as 12..dP(Xz1,z2,..zd)P(z1,z2,..zd)\int_1\int_2..\int_dP(X|z_1,z_2,..z_d)P(z_1,z_2,..z_d) . Calculating this is very difficult. For example, let say you have nnfacial images of KKdifferent persons. But in the data you only have images and identity is not given. Now here XXis some given image and ZZis identity of person whose the image belong. Now what is meant to calculate the posterior P(ZX)P(Z|X)? It will simple mean to find the identity of person of a person given the image. In this what would mean to calculate P(X)=zP(XZ)P(Z)P(X) = \int_z P(X|Z)P(Z)? This means to integrate over identity ZZof peoples which is dddimensional vector, in words it's taking some person and finding the probability of getting that facial image from that person and sum it for all the persons. Now this is intractable as large dimension of identity.

So, in most cases where ZZis large dimensional vector, it becomes intractable to calculate the posterior distribution.

Solution: Variational Inference

The idea is really simple: If we can’t get a tractable closed-form solution for P(ZX)P(Z|X), we’ll approximate it.

Let the approximation be Q(Z;ϕ)Q(Z;ϕ) and we can now form this as an optimization problem:

ϕ=argminϕKL[Q(Z;ϕ)P(ZX)]\phi^* = \arg min_ϕ KL[Q(Z;ϕ) || P(Z|X)]

By choosing a family of distribution Q(Z;ϕ)Q(Z;ϕ) flexible enough to model P(ZX)P(Z|X) and optimizing over ϕϕ, we can push the approximation towards the real posterior.

Now let’s expand the KL-divergence term

KL[Q(Z;ϕ)P(ZX)]=EQ[logQ(Z;ϕ)]EQ[logP(ZX)]=EQ[logQ(Z;ϕ)]EQ[logP(X,Z)P(X)]=EQ[logQ(Z;ϕ)]EQ[logP(X,Z)]+logP(X)KL[Q(Z;ϕ) || P(Z|X)] =E_Q[\log Q(Z;ϕ)]−E_Q[\log P(Z|X)] \\ =E_Q[\log Q(Z;ϕ)]−E_Q[\log \frac{P(X,Z)}{P(X)}] \\ =E_Q[\log Q(Z;ϕ)]−E_Q[\log P(X,Z)]+ \log P(X)

Although we can compute the first two terms in the above expansion, but oh lord ! the third term is the same annoying (intractable) integral we were avoiding before. What do we do now ? This seems to be a deadlock !

The Evidence Lower BOund (ELBO)

Please recall that our original objective was a minimization problem over Q(;ϕ)Q(⋅;ϕ). We can pull a little trick here - we can optimize only the first two terms and ignore the third term. How ?

Because the third term is independent of Q(;ϕ)Q(⋅;ϕ). So, we just need to minimize

EQ[logQ(Z;ϕ)]EQ[logP(X,Z)]E_ Q[\log Q(Z;ϕ)]−E_Q[\log P(X,Z)]

Or equivalently, maximize (just flip the two terms)

ELBO(Q)EQ[logP(X,Z)]EQ[logQ(Z;ϕ)]ELBO(Q)≜E_Q[\log P(X,Z)]−E_Q[\log Q(Z;ϕ)]

This term, usually defined as ELBO, is quite famous in VI literature and you have just witnessed how it looks like and where it came from. Taking a deeper look into the ELBO(⋅)

ELBO(Q)=EQ[logP(XZ)]+EQ[logP(Z)]EQ[logQ(Z;ϕ)]=EQ[logP(XZ)]KL[Q(Z;ϕ)P(Z)]ELBO(Q)=E_Q[\log P(X|Z)]+E_Q[\log P(Z)]−E_Q[\log Q(Z;ϕ)]\\ =E_Q[\log P(X|Z)]−KL[Q(Z;ϕ) || P(Z)]

Now, please consider looking at the last equation for a while because that is what all our efforts led us to. The last equation is totally tractable and also solves our problem. What it basically says is that maximizing ELBO(⋅ (which is a proxy objective for our original optimization problem) is equivalent to maximizing the conditional data likelihood (which we can choose in our graphical model design) and simultaneously pushing our approximate posterior (i.e., Q(;ϕ)Q(;ϕ)) towards a prior over ZZ. The prior P(Z)P(Z) is basically how the true latent space is organized. Now the immediate question might arise: “Where do we get P(Z)P(Z) from?”. The answer is, we can just choose any distribution as a hypothesis. It will be our belief of how the ZZ space is organized.Fig.5: Interpretation of ELBO

There is one more interpretation (see figure 5) of the KL-divergence expansion that is interesting to us. Rewriting the KL-expansion and substituting ELBO(⋅)ELBO(⋅) definition, we get

logP(X)=ELBO(Q)+KL[Q(Z;ϕ)P(ZX)]\log P(X)=ELBO(Q)+KL[Q(Z;ϕ) || P(Z|X)]

As we know that KL()0KL(⋅||⋅)≥0 for any two distributions, the following inequality holds

logP(X)ELBO(Q)\log P(X)≥ELBO(Q)

So, the ELBO()ELBO(⋅) that we vowed to maximize is a lower bound on the observed data log likelihood. Thats amazing, isn’t it ! Just by maximing the ELBO()ELBO(⋅), we can implicitely get closer to our dream of estimating maximum (log)-likelihood - tighter the bound, better the approximation.

Okay ! Way too much math for today. This is overall how the Variational Inference looks like. There are numerous

Resources

Last updated