Conjugate Gradient

Background

The conjugate gradient method is often implemented as an iterative algorithm, applicable to sparse systems that are too large to be handled by a direct implementation or other direct methods such as the Cholesky decomposition.

Iterative methods like CG are suited for use with sparse matrices. If AA is dense, your best course of action is probably to factor and solve the equation by backsubstitution. The time spent factoring a dense  is roughly equivalent to the time spent solving the system iteratively; and once  is factored, the system can be backsolved quickly for multiple values of bb. Compare this dense matrix with a sparse matrix of larger size that fills the same amount of memory. The triangular factors of a sparse AA usually have many more nonzero elements thanAA itself. Factoring may be impossible due to limited memory, and will be time-consuming as well; even the backsolving step may be slower than iterative solution. On the other hand, most iterative methods are memory-efficient and run quickly with sparse matrices.

NOTE: This method only works, when matrix A is positive definite.

Problem Setting

Used to solve the linear system of equations denoted by

Ax=bAx = b

Where AA is PSD matrix i.e xTAx0x^TAx \geq 0 and n×n.n \times n.

Another kind of really different way to see this is from the perspective of optimization on gradient descent, this where the "gradient descent" comes from in "conjugate gradient descent". The way to look at solving Ax=bAx=b, is to see it as minization of following problem

f(x)=12xTAxbxf(x) = \frac{1}{2}x^TAx - bx

Now to minimize f(x)f(x), we need f(x)=0f'(x)=0. Hence

f(x)=0:=Axb=0f'(x) =0 := Ax-b = 0

Which is exactly the equation we want to solve. And since f(x)=Af''(x)=Aand since AA is PSD, can be sure sure that solution is minima and not a maxima.

Steepest Gradient Descent

Let's first start talking with a version of gradient descent that's called Steepest Descent.

So, in steepest descent we take multiple steps to reach the optimal value. Many of those steps might be in the same direction. Because at each point, we are just going in the direction perpendicular to level sets. What this means, is that we are searching in the span defined by gradients at each point. Something like below

In GD, search is spanned by the gradient vectors and these gradient vectors are orthogonal to each other.

You solve it using iterative methods as follows:

xt+1=xtαtf(x)x_{t+1} = x_t - \alpha_t f'(x)

Now, how steepest descent is different from general gradient descent is, that here we try to find an optimal value for step size αt\alpha_t. We find that by using

αT=argminαtf(xtαtgt)\alpha_T^\star = \arg\min_{\alpha_t} f(x_t-\alpha_t g_t)

which we find by following

f(xtαtgt)αt=0αt=gtTgtgtTAgt\frac{\partial f(x_t-\alpha_t g_t)}{\partial \alpha_t} = 0 \\ \alpha_t^\star = \frac{g_t^Tg_t}{g_t^TAg_t}

From above equations we also get the following

f(xtαtgt)gt=0gt+1Tgt=0f'(x_t-\alpha_tg_t) g_t = 0\\ g_{t+1}^Tg_t = 0

i.e gradients are orthogonal to each other at each step.

Conjugate Gradient Descent

In CG, we search in the space defined by direction vectors such that we don't take steps in the same direction, i.e in each iteration, we are exploring in new direction. So for example, finding a point in nndim space would take only n steps.

So you iteratively find different directions, such that they are conjugate orthogonal (why they need to be conjugate orthogonal?) to each other and search in those directions.

If you think about it, since it's n-dimensional space, you can express each vector as a combination of nn independent vectors that spans that nn dim space, so, let's say you start from x0x_0, now you should be able to find nn components (α\alpha), in which proportion you combine the nn vectors to get the solution.

So basically you find the nn eigen vectors in the space of optimization, in each step you find one direction and take exactly that much step in that direction that's needed to reach the solution in that direction. i.e At each step you are finding the one correct component of the total vector.

Key question: How do we find those directions?

What this basically means is that each step direction will be orthogonal to each other, but not in euclidean space but in the optimization space.

Two non-zero vectors u,vu, v are conjugate wrt matrix AA, if

uTAv=0u^TAv=0

Geometrically, what this means is that the matrix AA transforms vv, such that the transfomed vector vvis orthogonal to vector uu.

Direct Method

The direct method depends upon the condition of conjugancy, find the a basis P=p1,p2,pnP = {p_1, p_2, \dots p_n} in Rn\mathbb{R}^n. such that each of paris of pi,pjp_i, p_j are conjugate, ie piTApj=0p_i^TAp_j=0.

Now we write the xx as combination of spanning vectors, ie x=iαipix = \sum_i \alpha_i p_i. Now we have

Ax=biαiApi=bAx = b \\ \sum_i \alpha_i Ap_i =b

Now the trick is to multiply that equation by pkp_k, then left hand side will be zero for all kik \neq i.

iαipkTApi=pkTb\sum_i \alpha_i p_k^TAp_i =p_k^Tb

k \forall k kik\neq i, the terms will be zero. leaving only term with i=ki=k, then we will get

αk=pkTbpkTApk\alpha_k = \frac{p_k^Tb}{p_k^TAp_k}

repeat this for k=iik=i \quad \forall i, to get all αi\alpha_i.

Hence the algorithm consists to steps:

  • Having the basis comprising of conjugate vectors wrt matrix AA.

  • Finding αii\alpha_i \quad \forall i.

Why do n mutually conjugate vectors form a basis for R^n?

Ans: https://home.cc.umanitoba.ca/~lovetrij/cECE7670/2005/Slides/slides4.pdf

Iterative Method

Orthogonal gradient descents

Generally to minimize the f(x), we can use gradient descent. Start with initial xx, and move into the direction of gradient descents to iterate to new xx.

Conjugate gradient descent is different in the way that in conjugate gradient descent, the gradients updates are conjugate to each other or orthogonal to each other wrt matrix AA. To do this, we use Gram-Schmidt orthonormalization. Basically we subtract any components of previous gradient directions from the current gradient direction which we get from gradient descent direction.

Properties of CGD

  • The matrix AAneed to be PSD.

  • For a system of n variables. The CGD should converge in n iterations.

Resources

Last updated