Regression is a set of statistical approaches used for approximating the relationship between a dependent variable and one or more independent variables. The term “regression” was coined by Francis Galton to describe the phenomenon of the heights of descendants of tall ancestors regressing down to the normal average, i.e., regression to mean. However, regression as a concept was created and employed by Legendre and Gauss, who used the least-squares method to determine the orbits of celestial bodies around the Sun.
Today regression is mainly used for two purposes. First, regression is used for prediction and forecasting problems. Secondly, it is used to map the causality of factors, to infer the cause and effect relationship between the dependent and independent variables. But aren’t those two the same thing? No, not exactly. You see, regression on its own can only infer the causal relationships between the independent and dependent variables in a limited dataset. The data scientists need to prove that relationships inferred from a sample, the dataset, have predictive power for a new context, the global population, to use regressions for prediction. This can be accomplished by following a series of statistical methods that test whether the dataset belongs to the population’s distribution.
The most widely used variant of regression is linear regression, which finds a line, hyperplane, that most closely fits the data. This is illustrated in the plot below.
This line allows us to estimate the value of the dependent variable when the independent variables take on a given set of values. It can be formulated as:
Where wi represents the coefficients of the independent variables, Xi and w0 is the intercept. But there would be millions of lines, how do we decide which is the best? For the purposed of modelling/training, the performance of machine learning methods is measured using loss functions. In the case of linear regression, it indicates how well it fits the data. Let’s take the example of the mean square error, which calculates the squared difference between the actual value and the predicted value.
Okay, so now we have a way to formulate our objective, the linear function, and a means to check how well it performs, the loss function. But how do we find the optimal parameters, w0 and wi, that correspond to the best-fitted line? The brute-force approach would be to try out all possible combination, but the coefficients are real numbers and can take an infinite number of values. To overcome this problem, we use the gradient descent algorithm that uses the derivatives to find the minima of the loss function.
Visually speaking, let’s say that the loss function plotted in 3D space looks like the plot above. Gradient descent starts on a random point in the loss function’s plane and uses the derivatives to determine the direction of minima and move towards it. It can be thought of as a ball placed on the loss function’s contour.
More often than not, the dependent and independent variables will have a non-linear relationship. In such use-cases, linear regression fails to fit the data. To overcome this problem, a non-linear function is instead of a linear function. This variant of regression is called polynomial regression.
Polynomial regression improves the linear model by introducing extra predictors obtained by raising the linear predictors to a certain power. For instance, a quadratic regression would have two terms X and X2 as predictors for each independent variable. This method enables the model to learn non-linear relationships.
Ridge and Lasso Regression
So far, we have only been concerned with fitting the data well, but there’s more to a good machine learning model than just that. The other issue we need to address is overfitting; this is when the model performs really well on the training dataset, the sample, and not as well on the test data, the actual population. Regularization techniques offer an easy workaround to this issue. As far as regression tasks are considered, they’re two prominent regularized variants of regression: Ridge and Lasso. Both of these impose a constraint on the coefficients of the independent variables; this reduces the magnitude of coefficients helps reduce the model complexity and multi-collinearity.
Ridge regression, also known as L2 regularization modifies the cost function and adds a penalty equivalent to the square of the coefficients:
The lambda term regularizes the coefficients so that if the coefficients take large values, the loss function is penalized. Lasso(Least Absolute Shrinkage and Selection Operator) regression is very similar to ridge regression; the only difference is that it uses the magnitude of the coefficient instead of taking their squares:
This type of regularization (L1) often leads to zero coefficients, i.e. some of the independent variables are completely ignored. Therefore, not only does lasso regression help reduce over-fitting, but it also doubles as a feature selection technique.
In addition to using polynomial regression with regularization, there’s another approach for fitting non-linear data using regression – binning. Instead of considering the whole dataset at once, spline regression divides the dataset into bins and creates separate models for each bin.
Dividing the data into separate pieces allows the model to fit linear or low degree polynomial functions. Knots are the points where the data is split and the sections are called splines. And the functions used for modelling each bin are called piecewise functions. Read more (about spline regression)
Alright, so regression can be used to estimate continuous dependent variables, but can it work in classification tasks? Yes, but not on its own. Instead of regressing to the best fit for the data, the line regresses to the optimal decision boundary for separating two classes. Logistic regression does this by altering the line equation and compositing it with the logistic/sigmoid function:
The goal remains the same, to find the best w0 and wi for the data. Logistic regression predicts the probability of the default class. If we consider the example of a binary classification problem, the output h is the predicted probability that example xi is a positive match in the classification task, given by:
When this probability is greater than 0.5 then we can classify the example xi as the default class. The probability is greater than 0.5 when g(H) is greater than 0.5, and this is true when H = w0 + wi ∗ 𝝨 X ≥ 0. This make hyperplane, w0 + wi ∗ 𝝨 X = 0 the decision boundary.
All clear on the theory and itching to write some code? Here are a few more posts to help you get started with implementing regression models using various tools and languages:
- Hands-On Linear Regression Using Sklearn
- How To Do Linear Regression In Excel
- A Hands-On Guide To Regression With Fast.ai
- How To Code Linear Regression Models With R
Subscribe to our NewsletterGet the latest updates and relevant offers by sharing your email.
A machine learning enthusiast with a knack for finding patterns. In my free time, I like to delve into the world of non-fiction books and video essays.