🧠
AI
  • Artificial Intelligence
  • Intuitive Maths behind AI
    • Probability
    • Information Theory
    • Linear Algebra
    • Calculus
  • Overview
  • Research Ideas and Philosophy
  • Basic Principles
  • Information Theory
    • Entropy
    • Log Probability
  • Probability & Statistics
    • Random Variables
    • Probability
      • Probablistic Equations
      • Bayes Theorem
      • Probability Distributions & Processes
    • Statistics
      • Measures
      • Z-Scores
      • Covariance and Correlation
      • Correlation vs Dependance
    • Mahalanobis vs Chi-Squared
    • Uncertainty
    • Statistical Inference
      • Graphical Models
      • Estimator vs Parameter
      • Estimation
      • Bayesian/Probabilistic Inference
        • Probabilistic Modelling
        • Problems of Bayesian Inference
        • Conjugate Priors
        • Dirichlet Distribution/Process
        • Posterior Predictive Distribution
      • Sampling-Based Inference
    • Sampling
      • Rejection Sampling
      • Reservoir Sampling
      • Thompson Sampling
    • Bayesian Inference
    • Regression
    • Markov
    • Monte Carlo
      • Monte Carlo Estimators
      • Importance Sampling
    • Kernel Density Estimation
    • Gaussian Processes
    • Gaussian Soap Bubble
  • Linear Algebra
    • Vector Space and Matrices
    • Geometry of System of Linear Equations
    • Determinants
    • Transformations
    • Geometrical Representation
    • Positive (Semi)Definite Matrices
    • Matrix Interpretation
    • Dot Product as Linear Transformation and Duality of Vector-Linear Transformation
    • Norms
    • Linear Least Square
    • Matrix Decomposition
      • QR Decomposition
      • Cholesky Decomposition
      • Eigen Value Decomposition
      • SVD - Singular Value Decomposition
    • Matrix Inversion
    • Matrix Calculus
    • Matrix Cookbook
    • Distributed Matrix Algebra
    • High Dimensional Spaces
  • Optimization
    • Derivatives
      • Partial Derivative
      • Directional Derivative
      • Gradient
      • Jacobian
    • Regularization
    • Gradient Descent
    • Newton's Method
    • Gauss-Newton
    • Levenberg–Marquardt
    • Conjugate Gradient
    • Implicit Function Theorem for optimization
    • Lagrange Multiplier
    • Powell's dog leg
    • Laplace Approximation
    • Cross Entropy Method
    • Implicit Function Theorem
  • Statistical Learning Theory
    • Expectation Maximization
  • Machine Learning
    • Clustering
    • Bias Variance Trade-off
  • Deep Learning
    • PreProcessing
    • Convolution Arithmetic
    • Regularization
    • Optimizers
    • Loss function
    • Activation Functions
    • Automatic Differentiation
    • Softmax Classifier and Cross Entropy
    • Normalization
    • Batch Normalization
    • Variational Inference
    • VAE: Variational Auto-Encoders
    • Generative vs Discriminative
      • Generative Modelling
    • Making GANs train
    • Dimensionality of Layer Vs Number of Layers
    • Deep learning techniques
    • Dilated Convolutions
    • Non-Maximum Suppression
    • Hard Negative Mining
    • Mean Average Precision
    • Fine Tuning or Transfer Learning
    • Hyper-parameter Tuning
  • Bayesian Deep Learning
    • Probabilistic View
    • Uncertainty
    • Variational Inference for Bayesian Neural Network
  • Reinforcement Learning
    • General
    • Multi-armed Bandit
    • Imitation Learning
    • MDP Equations
    • Solving MDP with known Model
    • Value Iteration
    • Model Free Prediction and Control
    • Off Policy vs On Policy
    • Control & Planning from RL perspective
    • Deep Reinforcement Learning
      • Value Function Approximation
      • Policy Gradient
        • Algorithms
    • Multi Agent Reinforcement Learning
    • Reinforcement Learning - Sutton and Barto
      • Chapter 3: Finite Markov Decision Processes
      • Chapter 4: Dynamic Programming
    • MBRL
  • Transformers
    • Tokenziation
    • Embedding
      • Word Embedding
      • Positional Encoding
    • Encoder
    • Decoder
    • Multi-head Attention Block
    • Time Complexities of Self-Attention
    • KV Cache
    • Multi-head Latent Attention
    • Speculative Decoding
    • Flash Attention
    • Metrics
  • LLMs
    • LLM Techniques
    • LLM Post-training
    • Inference/Test Time Scaling
    • Reasoning Models
    • Reward Hacking
  • Diffusion Models
    • ImageGen
  • Distributed Training
  • State Space Models
  • RLHF
  • Robotics
    • Kalman Filter
    • Unscented Kalman Filter
  • Game Theory and ML
    • 1st Lecture - 19/01
    • Lecture 2 - 22/01
    • Lecture 4: Optimization
  • Continual Learning
    • Lecture - 21/01
    • iCaRL: Incremental Classifier and Representation Learning
    • Variational Continual Learning
  • Computer Vision
    • Hough Transform
    • Projective Geometry
      • Extrinsic and Intrinsic Parameters
      • Image Rectification
    • Tracking
    • Optical Flow
    • Harris Corner
    • Others
  • Papers
    • To Be Read
    • Probabilistic Object Detection and Uncertainty Estimation
      • BayesOD
      • Leveraging Heteroscedastic Aleatoric Uncertainties for Robust Real-Time LiDAR 3D Object Detection
      • Gaussian YOLOv3
      • Dropout Sampling for Robust Object Detection in Open-Set Condition
      • *Sampling Free Epistemic Uncertainty Estimation using Approximated Variance Propagation
      • Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
      • Can We Trust You? On Calibration of Probabilistic Object Detector for Autonomous Driving
    • Object Detection
    • Temporal Fusion in Object Detection/ Video Object Detection
    • An intriguing failing of convolutional neural networks and the CoordConv solution
    • A Neural Algorithm of Artistic Style - A.Gatys
  • Deep Learning Book
    • Chapter 4: Optimization
    • Chapter 5: Machine Learning Basics
    • Chapter 6: Deep FeedForward Networks
  • Python
    • Decorators
    • Packages
      • Pip
    • Gotchas
    • Async functions
  • Computer Science
  • TensorFlow
  • Pytorch
    • RNN/LSTM in Pytorch
    • Dataset/ Data loader
    • Resuming/Loading Saved model
  • Programming
    • Unit Testing
    • How to write code
  • General Software Engineering
    • SSH tunneling and Ngrok
  • How To Do Research
  • Resources
  • ROS for python3
  • Kitti
Powered by GitBook
On this page
  • Gradient
  • Gradient in Matrix, Vector forms
  • Jacobian
  • Chain Rule
  • Vector
  • Resources
  1. Linear Algebra

Matrix Calculus

Gradients, Jacobians, etc in Matrix algebra

Gradient

f:Rd→Rf: \mathcal{R}^d \rightarrow \mathcal{R}f:Rd→R where fff is some function.

y^=f(w)\hat y = f(\mathbf w)y^​=f(w)

Here w∈Rd\mathbf w \in \mathcal{R}^dw∈Rd, now for some reason we have want the derivative of fff wrt to each elements of w\mathbf ww, that will be called gradient. Represented as follows:

∇wf(w)=[∂f∂w1,…,∂f∂wd]T\nabla_\mathbf w f(\mathbf w) = [\frac{\partial f}{\partial w_1}, \dots, \frac{\partial f}{\partial w_d}]^T∇w​f(w)=[∂w1​∂f​,…,∂wd​∂f​]T

The grad\text{grad}grad ∇wf(w)\nabla_\mathbf w f(\mathbf w)∇w​f(w)is supposed to be also a horizontal vector of same dimension as w\mathbf ww.

  • Note that the f(w)f(\mathbf w)f(w) is scalar values function but ∇wf(w)\nabla_\mathbf w f(\mathbf w)∇w​f(w) is actually a vector values function.

  • The gradients are perpendicular to the contour lines of the curve fff.

  • The gradient of fffpoints in the direction of the steepest ascent. Why? Think using directional derivatives.

Gradient in Matrix, Vector forms

Let's say that f(w)=wâ‹…x=wTx=xTwf( \mathbf w) = \mathbf w\cdot \mathbf x = \mathbf w^T\mathbf x = \mathbf x^T \mathbf wf(w)=wâ‹…x=wTx=xTw which is simply a linear function. Here x is some input vector of dimension same as . Then we have

∇wf(w)=∇w(wTx)=x\nabla_{\mathbf w} f(\mathbf w) = \nabla_\mathbf w (\mathbf w^T \mathbf x) = \mathbf x∇w​f(w)=∇w​(wTx)=x

Jacobian

Now let f:Rn→Rmf: \mathcal{R}^n \rightarrow \mathcal{R}^mf:Rn→Rm. In this case we would have a jacobian JJJ

J=[∂f∂x1…∂f∂xn]=[∇Tf1…∇Tfm]=[∂f1∂x1…∂f1∂xn⋮…⋮∂fm∂x1…∂fm∂xn]J = \begin{bmatrix} \frac{\partial \mathbf f}{\partial x_1} & \dots & \frac{\partial \mathbf f}{\partial x_n}\\ \end{bmatrix} \\ = \begin{bmatrix} \nabla^T f_1\\ \dots\\ \nabla^T f_m\\ \end{bmatrix} \\ = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \dots & \frac{\partial f_1}{\partial x_n}\\ \vdots & \dots & \vdots\\ \frac{\partial f_m}{\partial x_1} & \dots & \frac{\partial f_m}{\partial x_n}\\ \end{bmatrix}J=[∂x1​∂f​​…​∂xn​∂f​​]=​∇Tf1​…∇Tfm​​​=​∂x1​∂f1​​⋮∂x1​∂fm​​​………​∂xn​∂f1​​⋮∂xn​∂fm​​​​

Note: Jacobian and Gradient are transpose of each other. So, if you try to calculate derivative of single value function with respect to a vector using Jacobian, it would be a row vector. So to get a gradient you have transpose it.

Chain Rule

Vector

Let f,g\mathbf{f, g}f,g be two vector valued function and xxx be a scalar. Then

∇xf=∂f(g(x))∂x=∂f∂g∂g∂x\nabla_x \mathbf f = \frac{\partial \mathbf {f(g(}x))}{\partial x} = \frac{\partial \mathbf f}{\partial \mathbf g}\frac{\partial \mathbf g}{\partial x}∇x​f=∂x∂f(g(x))​=∂g∂f​∂x∂g​

Now if there are multiple paramteres i.e. it's a vector x\mathbf xx. Then it's

∇xf=∂f(g(x))∂x=∂f∂g∂g∂x\nabla_{\mathbf x} \mathbf f = \frac{\partial \mathbf {f(g(x))}}{\partial \mathbf x} = \frac{\partial \mathbf f}{\partial \mathbf g}\frac{\partial \mathbf g}{\partial \mathbf x}∇x​f=∂x∂f(g(x))​=∂g∂f​∂x∂g​

Resources

PreviousMatrix InversionNextMatrix Cookbook

Last updated 11 months ago

The Matrix Calculus You Need For Deep Learningthe_antlr_guy
Dimension of reulting Jacobian shape based on function and input shape
Full Jacobian Calculation
Chain Rule for different dimension of variables.