Linear Algebra Matrix Calculus Gradients, Jacobians, etc in Matrix algebra
Gradient
f : R d → R f: \mathcal{R}^d \rightarrow \mathcal{R} f : R d → R where f f f is some function.
y ^ = f ( w ) \hat y = f(\mathbf w) y ^ ​ = f ( w ) Here w ∈ R d \mathbf w \in \mathcal{R}^d w ∈ R d , now for some reason we have want the derivative of f f f wrt to each elements of w \mathbf w w , that will be called gradient. Represented as follows:
∇ w f ( w ) = [ ∂ f ∂ w 1 , … , ∂ f ∂ w d ] T \nabla_\mathbf w f(\mathbf w) = [\frac{\partial f}{\partial w_1}, \dots, \frac{\partial f}{\partial w_d}]^T ∇ w ​ f ( w ) = [ ∂ w 1 ​ ∂ f ​ , … , ∂ w d ​ ∂ f ​ ] T The grad \text{grad} grad ∇ w f ( w ) \nabla_\mathbf w f(\mathbf w) ∇ w ​ f ( w ) is supposed to be also a horizontal vector of same dimension as w \mathbf w w .
Note that the f ( w ) f(\mathbf w) f ( w ) is scalar values function but ∇ w f ( w ) \nabla_\mathbf w f(\mathbf w) ∇ w ​ f ( w ) is actually a vector values function.
The gradients are perpendicular to the contour lines of the curve f f f .
The gradient of f f f points in the direction of the steepest ascent. Why? Think using directional derivatives.
Gradient in Matrix, Vector forms
Let's say that f ( w ) = w â‹… x = w T x = x T w f( \mathbf w) = \mathbf w\cdot \mathbf x = \mathbf w^T\mathbf x = \mathbf x^T \mathbf w f ( w ) = w â‹… x = w T x = x T w which is simply a linear function. Here x is some input vector of dimension same as . Then we have
∇ w f ( w ) = ∇ w ( w T x ) = x \nabla_{\mathbf w} f(\mathbf w) = \nabla_\mathbf w (\mathbf w^T \mathbf x) = \mathbf x ∇ w ​ f ( w ) = ∇ w ​ ( w T x ) = x Jacobian
Now let f : R n → R m f: \mathcal{R}^n \rightarrow \mathcal{R}^m f : R n → R m . In this case we would have a jacobian J J J
J = [ ∂ f ∂ x 1 … ∂ f ∂ x n ] = [ ∇ T f 1 … ∇ T f m ] = [ ∂ f 1 ∂ x 1 … ∂ f 1 ∂ x n ⋮ … ⋮ ∂ f m ∂ x 1 … ∂ f m ∂ x n ] J = \begin{bmatrix}
\frac{\partial \mathbf f}{\partial x_1} & \dots & \frac{\partial \mathbf f}{\partial x_n}\\
\end{bmatrix} \\
= \begin{bmatrix}
\nabla^T f_1\\
\dots\\
\nabla^T f_m\\
\end{bmatrix} \\
= \begin{bmatrix}
\frac{\partial f_1}{\partial x_1} & \dots & \frac{\partial f_1}{\partial x_n}\\
\vdots & \dots & \vdots\\
\frac{\partial f_m}{\partial x_1} & \dots & \frac{\partial f_m}{\partial x_n}\\
\end{bmatrix} J = [ ∂ x 1 ​ ∂ f ​ ​ … ​ ∂ x n ​ ∂ f ​ ​ ] = ​ ∇ T f 1 ​ … ∇ T f m ​ ​ ​ = ​ ∂ x 1 ​ ∂ f 1 ​ ​ ⋮ ∂ x 1 ​ ∂ f m ​ ​ ​ … … … ​ ∂ x n ​ ∂ f 1 ​ ​ ⋮ ∂ x n ​ ∂ f m ​ ​ ​ ​ Note: Jacobian and Gradient are transpose of each other. So, if you try to calculate derivative of single value function with respect to a vector using Jacobian, it would be a row vector. So to get a gradient you have transpose it.
Chain Rule
Vector
Let f , g \mathbf{f, g} f , g be two vector valued function and x x x be a scalar. Then
∇ x f = ∂ f ( g ( x ) ) ∂ x = ∂ f ∂ g ∂ g ∂ x \nabla_x \mathbf f = \frac{\partial \mathbf {f(g(}x))}{\partial x} = \frac{\partial \mathbf f}{\partial \mathbf g}\frac{\partial \mathbf g}{\partial x} ∇ x ​ f = ∂ x ∂ f ( g ( x )) ​ = ∂ g ∂ f ​ ∂ x ∂ g ​ Now if there are multiple paramteres i.e. it's a vector x \mathbf x x . Then it's
∇ x f = ∂ f ( g ( x ) ) ∂ x = ∂ f ∂ g ∂ g ∂ x \nabla_{\mathbf x} \mathbf f = \frac{\partial \mathbf {f(g(x))}}{\partial \mathbf x} = \frac{\partial \mathbf f}{\partial \mathbf g}\frac{\partial \mathbf g}{\partial \mathbf x} ∇ x ​ f = ∂ x ∂ f ( g ( x )) ​ = ∂ g ∂ f ​ ∂ x ∂ g ​ Resources
Last updated 8 months ago