Regularization

L1 vs L2

From the perspective of constraint

We should the regularization terms in the loss function as a lagrange multiplier term in the lagrange technique where regularization term is basically the constraint from the perspective of lagrange method. And there is difference between L1 and L2 constraint curves as shown in the article below, L2 constraint curve is basically a circle while L1 has different curve equation.

Dealing With Multicollinearity: A Brief Overview and Introduction to Tolerant MethodsWater Programming: A Collaborative Research Blog

Regularization in simple math explainedData Science Stack Exchange

A visual explanation of linear model regularizationthe_antlr_guy

From the perspective of gradient penalty

The gradient of L1 terms is always $1$ , irrespective of the value/magnitude of weight. Hence there will always of gradient penalty in the total loss, leading to smaller values of weight i.e leading to 0 unless the actually loss function take the weight to the value where the loss is minimized. hence weights corresponding to the features that doesn't influence output will be naturally settle at zero. Whereas, for L2 regularization, the gradient penalty depends upon of the value of the weight. Hence, the gradient penalty goes down as the weight goes down, hence for smaller weights the gradient penalty of L2 regularization is less than L1. Hence, L2 regularization doesn't take weights all the way to 0.

A visual explanation of linear model regularizationthe_antlr_guy

Why L1 norm creates Sparsity compared with L2 normMedium

PreviousJacobian NextGradient Descent

Last updated 2 years ago