Newton's Method

Find Root of non-linear function

Used to find the roots of an function f(x)=0f(x)=0. It uses taylor series expansion for that. It's an iterative process. Let x0x_0be initial starting point. Let's say next point is x1=x0+Δxx_1 = x_0 + \Delta x, which is the root of the f(x)=0f(x)=0. Then 0=f(x+Δx)=f(x)+Δxf′(x)0 = f(x+\Delta x) = f(x) + \Delta x f'(x) . Which means Δx=−f(x)f′(x)\Delta x = -\frac{f(x)}{f'(x)}.

Finding minimum of a function

Let's say you have a function f(x)f(x)whose minimum you want to find, this means f′(x)=0f'(x)=0. Hence you can just use newton's method to find root of f′(x)=0f'(x)=0. This will find the minimum of f(x)f(x)if f′′(x)>0f''(x)>0.

So

f(x+t)=f(x)+tf′(x)+12t2f′′(x)f(x+t) = f(x)+t f'(x) + \frac{1}{2}t^2f''(x)

​Now we want

df(x+t)dt=0f′(x)+tf′′(x)=0t=−f′(x)f′′(x)\frac{df(x+t)}{dt} = 0 \\ f'(x) + t f''(x) = 0 \\ t = \frac{-f'(x)}{f''(x)}

Therefor x′=x+t=x​−f′(x)f′′(x)x' = x+t = x ​- \frac{f'(x)}{f''(x)}.

  • If the function ffis actually quadratic then the above method will give you the minimum in exactly in one iteration.

  • It has quadratic convergence.

  • Calculating the inverse of hessian is a very computationally expensive operation, which causes newton's method to be slow.

  • It can be very sensitive to the initialization of the algorithm.

Connection with Gradient Descent

So how gradient descent works, is by replacing a function at a point by it's linear approximation at that point using the gradient of the function and then moving along the negative gradient of the line. But it only gradient only gives the direction to move and not the magnitude of movement, for which we use step size which determines how much to move. Now, if step size is too big then we might be oscillating as we might overshoot the actual minimum, if step size is too small then it might take too many iterations to reach the minimum.

So now, we can see 1f′′(x)\frac{1}{f''(x)}​(inverse hessian, or inverse of curvature) as the step size. Because if the curvature is high, it means the gradient is changing quickly so we need to take small step sizes and conversely if the curvature is low it means we can take big step sizes. Hence, 1f′′(x)\frac{1}{f''(x)}is basically a smart/adaptive step size based on the curvature of the function.

​Resources

Thoughts

Last updated