🧠
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenberg–Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
    • Implicit Function Theorem
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • Diffusion Models
    • ImageGen
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  • Background
  • Problem Setting
  • Steepest Gradient Descent
  • Conjugate Gradient Descent
  • Direct Method
  • Iterative Method
  • Properties of CGD
  • Resources
  1. Optimization

Conjugate Gradient

PreviousLevenberg–MarquardtNextImplicit Function Theorem for optimization

Last updated 11 months ago

Background

The conjugate gradient method is often implemented as an , applicable to systems that are too large to be handled by a direct implementation or other direct methods such as the .

Iterative methods like CG are suited for use with sparse matrices. If AAA is dense, your best course of action is probably to factor and solve the equation by backsubstitution. The time spent factoring a dense  is roughly equivalent to the time spent solving the system iteratively; and once  is factored, the system can be backsolved quickly for multiple values of bbb. Compare this dense matrix with a sparse matrix of larger size that fills the same amount of memory. The triangular factors of a sparse AAA usually have many more nonzero elements thanAAA itself. Factoring may be impossible due to limited memory, and will be time-consuming as well; even the backsolving step may be slower than iterative solution. On the other hand, most iterative methods are memory-efficient and run quickly with sparse matrices.

NOTE: This method only works, when matrix A is positive definite.

Problem Setting

Used to solve the linear system of equations denoted by

Ax=bAx = bAx=b

Where AAA is PSD matrix i.e xTAx≥0x^TAx \geq 0xTAx≥0 and n×n.n \times n.n×n.

Another kind of really different way to see this is from the perspective of optimization on gradient descent, this where the "gradient descent" comes from in "conjugate gradient descent". The way to look at solving Ax=bAx=bAx=b, is to see it as minization of following problem

f(x)=12xTAx−bxf(x) = \frac{1}{2}x^TAx - bxf(x)=21​xTAx−bx

Now to minimize f(x)f(x)f(x), we need f′(x)=0f'(x)=0f′(x)=0. Hence

Steepest Gradient Descent

Let's first start talking with a version of gradient descent that's called Steepest Descent.

So, in steepest descent we take multiple steps to reach the optimal value. Many of those steps might be in the same direction. Because at each point, we are just going in the direction perpendicular to level sets. What this means, is that we are searching in the span defined by gradients at each point. Something like below

In GD, search is spanned by the gradient vectors and these gradient vectors are orthogonal to each other.

You solve it using iterative methods as follows:

which we find by following

From above equations we also get the following

i.e gradients are orthogonal to each other at each step.

Conjugate Gradient Descent

So you iteratively find different directions, such that they are conjugate orthogonal (why they need to be conjugate orthogonal?) to each other and search in those directions.

Key question: How do we find those directions?

What this basically means is that each step direction will be orthogonal to each other, but not in euclidean space but in the optimization space.

Direct Method

Hence the algorithm consists to steps:

Why do n mutually conjugate vectors form a basis for R^n?

Iterative Method

Orthogonal gradient descents

Properties of CGD

  • For a system of n variables. The CGD should converge in n iterations.

Resources

f′(x)=0:=Ax−b=0f'(x) =0 := Ax-b = 0f′(x)=0:=Ax−b=0

Which is exactly the equation we want to solve. And since f′′(x)=Af''(x)=Af′′(x)=Aand since AAA is PSD, can be sure sure that solution is minima and not a maxima.

xt+1=xt−αtf′(x)x_{t+1} = x_t - \alpha_t f'(x)xt+1​=xt​−αt​f′(x)

Now, how steepest descent is different from general gradient descent is, that here we try to find an optimal value for step size αt\alpha_tαt​. We find that by using

αT⋆=arg⁡min⁡αtf(xt−αtgt)\alpha_T^\star = \arg\min_{\alpha_t} f(x_t-\alpha_t g_t)αT⋆​=argαt​min​f(xt​−αt​gt​)
∂f(xt−αtgt)∂αt=0αt⋆=gtTgtgtTAgt\frac{\partial f(x_t-\alpha_t g_t)}{\partial \alpha_t} = 0 \\ \alpha_t^\star = \frac{g_t^Tg_t}{g_t^TAg_t}∂αt​∂f(xt​−αt​gt​)​=0αt⋆​=gtT​Agt​gtT​gt​​
f′(xt−αtgt)gt=0gt+1Tgt=0f'(x_t-\alpha_tg_t) g_t = 0\\ g_{t+1}^Tg_t = 0f′(xt​−αt​gt​)gt​=0gt+1T​gt​=0

In CG, we search in the space defined by direction vectors such that we don't take steps in the same direction, i.e in each iteration, we are exploring in new direction. So for example, finding a point in nnndim space would take only n steps.

If you think about it, since it's n-dimensional space, you can express each vector as a combination of nnn independent vectors that spans that nnn dim space, so, let's say you start from x0x_0x0​, now you should be able to find nnn components (α\alphaα), in which proportion you combine the nnn vectors to get the solution.

So basically you find the nnn eigen vectors in the space of optimization, in each step you find one direction and take exactly that much step in that direction that's needed to reach the solution in that direction. i.e At each step you are finding the one correct component of the total vector.

Two non-zero vectors u,vu, vu,v are conjugate wrt matrix AAA, if

uTAv=0u^TAv=0uTAv=0

Geometrically, what this means is that the matrix AAA transforms vvv, such that the transfomed vector vvvis orthogonal to vector uuu.

The direct method depends upon the condition of conjugancy, find the a basis P=p1,p2,…pnP = {p_1, p_2, \dots p_n}P=p1​,p2​,…pn​ in Rn\mathbb{R}^nRn. such that each of paris of pi,pjp_i, p_jpi​,pj​ are conjugate, ie piTApj=0p_i^TAp_j=0piT​Apj​=0.

Now we write the xxx as combination of spanning vectors, ie x=∑iαipix = \sum_i \alpha_i p_ix=∑i​αi​pi​. Now we have

Ax=b∑iαiApi=bAx = b \\ \sum_i \alpha_i Ap_i =bAx=bi∑​αi​Api​=b

Now the trick is to multiply that equation by pkp_kpk​, then left hand side will be zero for all k≠ik \neq ik=i.

∑iαipkTApi=pkTb\sum_i \alpha_i p_k^TAp_i =p_k^Tbi∑​αi​pkT​Api​=pkT​b

∀k \forall k∀k k≠ik\neq ik=i, the terms will be zero. leaving only term with i=ki=ki=k, then we will get

αk=pkTbpkTApk\alpha_k = \frac{p_k^Tb}{p_k^TAp_k}αk​=pkT​Apk​pkT​b​

repeat this for k=i∀ik=i \quad \forall ik=i∀i, to get all αi\alpha_iαi​.

Having the basis comprising of conjugate vectors wrt matrix AAA.

Finding αi∀i\alpha_i \quad \forall iαi​∀i.

Ans:

Generally to minimize the f(x), we can use gradient descent. Start with initial xxx, and move into the direction of gradient descents to iterate to new xxx.

Conjugate gradient descent is different in the way that in conjugate gradient descent, the gradients updates are conjugate to each other or orthogonal to each other wrt matrix AAA. To do this, we use . Basically we subtract any components of previous gradient directions from the current gradient direction which we get from gradient descent direction.

The matrix AAAneed to be PSD.

https://home.cc.umanitoba.ca/~lovetrij/cECE7670/2005/Slides/slides4.pdf
iterative algorithm
sparse
Cholesky decomposition
Gram-Schmidt orthonormalization
Conjugate Gradient Descent
How to compute Hessian-vector products? | ICLR Blogposts 2024
Logo
Now this, a lot of steps are in the same direction, but all the gradients are perpendicular to each other.
A comparison of the convergence of with optimal step size (in green) and conjugate vector (in red) for minimizing a quadratic function associated with a given linear system. Conjugate gradient, assuming exact arithmetic, converges in at most n steps, where n is the size of the matrix of the system (here n = 2).
gradient descent
Logo