What is regularization ?
In machine learning, regularization is any modification made to a learning algorithm that is intended to reduce its generalization error.
It is also a technique used for solving the problem of overfitting in a machine learning algorithm by penalizing the cost function. It is just the fact of adding λI to the solution of θ.
What is a regularizer ?
A regularizer is just a hyper-parameter that we add to our model. λ is called the regularization parameter. It controls the trade-off between : fitting the training data well and keeping the params small to avoid overfitting.
Why adding a regularizer to linear regression
For the example of a linear regression problem while searching the best theta that minimizes our function. We encounter a problem which is that our matrix is not invertible (XTX)-1 . In this case we have to add a regularizer to solve this problem.
Techniques of Regularization
There are two types of regularization techniques, namely Ridge Regression and Lasso Regression. The way they assign a penalty to θ (coefficients) is what differentiates them from each other.
- L2 Regularization or Ridge
When using this technique, we add the sum of weight’s square to a loss function and thus create a new loss function which is denoted :
- L1 Regularization or LASSO
Also denoted as below:
A regression model that uses L1 regularization technique is called Lasso (Least Absolute Shrinkage and Selection Operator) Regression and a model which uses L2 is called Ridge Regression.
In case of L2 regularization, the term λ2⋅‖θ‖2 is added to the loss function; in case of L1 regularization, the term λ1⋅‖θ‖1 is added instead; λ1, λ2 are hyper-parameters.
L2 regularization encourages the network to use all of its inputs a little, rather than some of the inputs a lot, while L1 regularization encourages the network to learn sparse weight vectors (which can be used, e.g., for feature selection tasks).
Ridge do a selection group of variables to all correlated variables they are assigned the same coefficient and therefore a reduction of the coefficients.For lasso selection of variables and reduction of dimensions
According to the geometry shape of l1 and l2 in the image above, the unconstrained model geometry (the ellipse) approaches the l1 and l2 constraint; In L1, the unconstrained model touches one part of the feature axis (w2 ; in the image above) , this shows that L1 prioritizes w2 over w1
For L2, the unconstrained model does not really rely on a particular feature but is suspended on a point between the w1 and w2.
Lasso regression differs from ridge regression in a way that it uses absolute values within the penalty function, rather than that of squares. This leads to penalizing (or equivalently constraining the sum of the absolute values of the estimates) values which causes some of the parameter estimates to turn out exactly zero.
The more penalty is applied, the more the estimates shrinks itself towards absolute zero. This helps to variable selection out of given range of n variables.
So we have discussed regularization and have gotten the intuition behind regularization. We were able to understand that L1 causes sparsity and l2 gives equal weight coefficient to the features.
Please leave comments, feedback, and suggestions if you feel any.