Optimizers
Different optimizers used in deep learning
Last updated
Different optimizers used in deep learning
Last updated
Here second order derivatives called Hessian are used to optimize.
Note: Only works when Hessian is positive everywhere i.e f(x) is convex i.e. it is convex optimization problem.
Second order derivatives are said to better optimizer as they don't settle in saddle points.
Ex: • Newton
• Gauss-Newton
• Quasi-Newton
• BFGS, (L)BFGS
Adaptive learning rate for each parameter according to its past gradient history. Parameters with less past gradients have high learning rate.
is an extension of Adagrad and it also tries to reduce Adagrad’s aggressive, monotonically reducing the learning rate
It does this by restricting the window of the past accumulated gradient to some fixed size of w. Running average at time t then depends on the previous average and the current gradient
RMSProp tries to resolve Adagrad’s radically diminishing learning rates by using a moving average of the squared gradient. It utilizes the magnitude of the recent gradient descents to normalize the gradient.
RMSProp divides the learning rate by the average of the exponential decay of squared gradients
Another method that calculates the individual adaptive learning rate for each parameter from estimates of first and second moments of the gradients.
It also reduces the radically diminishing learning rates of Adagrad
Adam can be viewed as a combination of Adagrad, which works well on sparse gradients and RMSprop which works well in online and nonstationary settings.
Adam implements the exponential moving average of the gradients to scale the learning rate instead of a simple average as in Adagrad. It keeps an exponentially decaying average of past gradients
Adam is computationally efficient and has very little memory requirement
Adam optimizer is one of the most popular gradient descent optimization algorithms
Learning Rate Scheduler works, if your loss is reaching plateau let's say after 20 epochs, then make sure to reduce the learning rate at epoch 20 and check the training then. See an example below.
In the experiments done, which is shown by above loss curve, learning rate
was decreased by factor of 0.5 at epoch 20 and 30. See how loss was saturated at epoch 20 but after reducing the learning rate the loss started decreasing and same behavior at epoch 30. See sort of spikes at epoch 20 and 30.
NOTE: Training with low learning rate in last times of training can lead to over-fitting as compared to training with high learning rate. Because small learning rate makes very small changes in weights which might be tuning the weights for training data only. On the other hand larger learning rate wouldn't let the network reach it optimal error. Hence should maintain a balance for training with smaller learning rate in last of the training. In the nutshell, don't train with small learning rate for more than what is require for optimal convergence.