Conjugate Gradient
Background
The conjugate gradient method is often implemented as an iterative algorithm, applicable to sparse systems that are too large to be handled by a direct implementation or other direct methods such as the Cholesky decomposition.
Iterative methods like CG are suited for use with sparse matrices. If is dense, your best course of action is probably to factor and solve the equation by backsubstitution. The time spent factoring a dense is roughly equivalent to the time spent solving the system iteratively; and once is factored, the system can be backsolved quickly for multiple values of . Compare this dense matrix with a sparse matrix of larger size that fills the same amount of memory. The triangular factors of a sparse usually have many more nonzero elements than itself. Factoring may be impossible due to limited memory, and will be time-consuming as well; even the backsolving step may be slower than iterative solution. On the other hand, most iterative methods are memory-efficient and run quickly with sparse matrices.
NOTE: This method only works, when matrix A is positive definite.
Problem Setting
Used to solve the linear system of equations denoted by
Where is PSD matrix i.e and
Another kind of really different way to see this is from the perspective of optimization on gradient descent, this where the "gradient descent" comes from in "conjugate gradient descent". The way to look at solving , is to see it as minization of following problem
Now to minimize , we need . Hence
Which is exactly the equation we want to solve. And since and since is PSD, can be sure sure that solution is minima and not a maxima.
Steepest Gradient Descent
Let's first start talking with a version of gradient descent that's called Steepest Descent.
So, in steepest descent we take multiple steps to reach the optimal value. Many of those steps might be in the same direction. Because at each point, we are just going in the direction perpendicular to level sets. What this means, is that we are searching in the span defined by gradients at each point. Something like below
In GD, search is spanned by the gradient vectors and these gradient vectors are orthogonal to each other.
You solve it using iterative methods as follows:
Now, how steepest descent is different from general gradient descent is, that here we try to find an optimal value for step size . We find that by using
which we find by following
From above equations we also get the following
i.e gradients are orthogonal to each other at each step.
Conjugate Gradient Descent
In CG, we search in the space defined by direction vectors such that we don't take steps in the same direction, i.e in each iteration, we are exploring in new direction. So for example, finding a point in dim space would take only n steps.
So you iteratively find different directions, such that they are conjugate orthogonal (why they need to be conjugate orthogonal?) to each other and search in those directions.
If you think about it, since it's n-dimensional space, you can express each vector as a combination of independent vectors that spans that dim space, so, let's say you start from , now you should be able to find components (), in which proportion you combine the vectors to get the solution.
So basically you find the eigen vectors in the space of optimization, in each step you find one direction and take exactly that much step in that direction that's needed to reach the solution in that direction. i.e At each step you are finding the one correct component of the total vector.
Key question: How do we find those directions?
What this basically means is that each step direction will be orthogonal to each other, but not in euclidean space but in the optimization space.
Two non-zero vectors are conjugate wrt matrix , if
Geometrically, what this means is that the matrix transforms , such that the transfomed vector is orthogonal to vector .
Direct Method
The direct method depends upon the condition of conjugancy, find the a basis in . such that each of paris of are conjugate, ie .
Now we write the as combination of spanning vectors, ie . Now we have
Now the trick is to multiply that equation by , then left hand side will be zero for all .
, the terms will be zero. leaving only term with , then we will get
repeat this for , to get all .
Hence the algorithm consists to steps:
Having the basis comprising of conjugate vectors wrt matrix .
Finding .
Why do n mutually conjugate vectors form a basis for R^n?
Ans: https://home.cc.umanitoba.ca/~lovetrij/cECE7670/2005/Slides/slides4.pdf
Iterative Method
Orthogonal gradient descents
Generally to minimize the f(x), we can use gradient descent. Start with initial , and move into the direction of gradient descents to iterate to new .
Conjugate gradient descent is different in the way that in conjugate gradient descent, the gradients updates are conjugate to each other or orthogonal to each other wrt matrix . To do this, we use Gram-Schmidt orthonormalization. Basically we subtract any components of previous gradient directions from the current gradient direction which we get from gradient descent direction.
Properties of CGD
The matrix need to be PSD.
For a system of n variables. The CGD should converge in n iterations.
Resources
Last updated