Machine Learning is the scientific process of developing an algorithm that learns the pattern from training data and performs inferences on test data. If a machine learning process is meant to predict some output value, it is called supervised learning. On the other hand, if there is no output value prediction, it is called unsupervised learning.
Training data in supervised learning contains a set of features and a target. The machine learning algorithm learns from the features to map corresponding targets. Test data contains only features so that the model should predict the targets. Features and targets are also called independent variables and dependent variables, respectively. Training data in unsupervised learning contains only features but not any target. Rather than mapping features and targets as in supervised learning, an unsupervised learning model performs clustering (grouping) the input data based on the patterns among them.
Supervised learning is classified into two categories:
Supervised learning is called regression if the dependent variable (aka target) is continuous. Supervised learning is called classification if the dependent variable is discrete. In other words, a regression model outputs a numerical value (a real floating value), but a classification model outputs a class (among two or more classes).
In this article, we discuss linear regression and its implementation with python codes. Regression analysis can be specifically termed linear regression if the dependent variable (target) has a linear relationship with the independent variables (features).
The Math behind Linear Regression
Suppose a collection of data has two variables: one is the independent variable (X), and another is the dependent variable (Y).
If the relationship between Y and X can be expressed as:
Y = mX + c, this is called linear regression. Here, X is linearly scaled with a weight m to determine the value of Y and c is called bias or y-intercept with which the dependency offsets. A machine learning model has to determine the most suitable values for weight, m and bias, c. If there are more than one independent variable, there will be a corresponding number of weights, w1, w2, w3, and so on.
Typically, a machine learning problem contains a remarkable amount of data. A linear regression model assigns random values to weights and bias at the beginning. When learning commences, the model is fed with one data point in each step. It fits the X values and determines the target. Since weights are randomly assigned initially, the predicted target will differ greatly from the actual target. The model calculates the difference between the actual target value and the predicted target value, which is called the loss. The model scientifically reassigns the values of weights to reduce this loss. With each data point, the model iteratively attempts to find suitable weights that yield minimum loss.
The most preferred losses are mean absolute error (MAE) and mean squared error (MSE). Mean absolute error is the mean value of the sum of differences between predicted and actual target values for all data points. Mean squared error is the mean value of the sum of squares of differences between predicted and actual target values for all data points. Linear regression employs mean squared error (MSE) as its loss function. When learning is finished, the loss value will be at its minimum. In other words, the predicted value will be as close as possible to the actual target value.
We try to get a better understanding in the sequel with a practical problem and hands-on Python implementation.
Load a Regression Data
Import necessary libraries and modules.
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.datasets import load_diabetes
Load a regression problem dataset from SciKit-Learn’s in-built datasets. Data is already preprocessed and normalized, and is ready to use.
data = load_diabetes() data.keys()
Generate features and target. Visualize the top 5 rows of the data.
features = pd.DataFrame(data['data'], columns=data['feature_names']) target = pd.Series(data['target'], name='target') feat.head()
Simple Linear Regression
Simple linear regression is performed with one dependent variable and one independent variable. In our data, we declare the feature ‘bmi’ to be the independent variable.
Prepare X and y.
X = features['bmi'].values.reshape(-1,1) y = target.values.reshape(-1,1)
Perform linear regression.
simple = LinearRegression() simple.fit(X,y)
The training is completed. We can explore the weight (coefficient) and bias (intercept) of the trained model.
Calculate the predictions following the formula, y = intercept + X*coefficient.
calc_pred = simple.intercept_ + (X*simple.coef_)
Predictions can also be calculated using the trained model.
pred = simple.predict(X)
We can check whether the calculated predictions and model’s predictions are identical.
(calc_pred == pred).all()
Plot the actual values and predicted values to get a better understanding.
# plot actual values plt.scatter(X,y, label='Actual') # plot predicted values plt.plot(X,pred, '-r', label='Prediction') plt.xlabel('Feature X') plt.ylabel('Target y') plt.title('Simple Linear Regression', color='orange', size=14) plt.legend() plt.show()
According to SciKit-Learn’s
LinearRegression method, the above red line is the best possible fit with minimal error value.
We can calculate the mean squared error value for the above regression using the following code.
This error value seems too high because of the nature of the actual data. It can be observed from the above plot that the target has multiple values corresponding to a single feature value. The data is highly scattered, which can not be fit completely with a straight line. However, we may wish to conclude how good the fit is. The error just yields an incomparable number.
A parameter named Coefficient of Determination (CoD) is helpful in this case. CoD gives the ratio of the regression sum of square to the total sum of the square. Total sum of squares (SST) is the sum of deviations of each y value from the mean value of y. Regression sum of squares (SSR) is the difference between the total sum of squares and the sum of squared error (SSE). When there is no error (MSE = 0), CoD becomes unity. When the sum of squared error equals the total sum of squares (SSE = SST), CoD becomes zero.
CoD = 1 refers to the best prediction
CoD = 0 refers to the worst prediction
CoD gives a limit [0,1], thus makes the predictions comparable. CoD is also called R-squared value. It can be calculated using the following code.
With high scatteredness in data, 0.34 is the best possible fit by linear regression.
Multiple Linear Regression
Multiple linear regression is performed with more than one independent variable. We choose the following columns as our features.
columns = ['age', 'bmi', 'bp', 's3', 's5']
Let’s have a look at the data distribution by plotting it.
for i in columns: plt.scatter(features[i], y) plt.xlabel(str(i)) plt.show()
It is observed that each individual feature has scatteredness in nature. But, the variation in target values for a single input feature value may be explained by some other features. In other words, the target value may find difficulty in fitting a linear regression model with a single feature. Nevertheless, it may yield an improved fit with multiple features by exploring the true pattern in the data.
In the simple linear regression implementation, we have used all our data to fit the model. But, how can we test our model? How far will our model perform on unforeseen data? This is where the train-test-split comes into play. We split our dataset into two sets: a training set and a validation set. We train our model with training data only and evaluate it with the validation set.
Let’s split the dataset into training and validation sets.
from sklearn.model_selection import train_test_split X = features[columns] # 70% training data, 30% validation data X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=6)
Build a linear regression model and fit the data.
multi = LinearRegression() multi.fit(X_train, y_train)
What are the weights (coefficients) of our model? There should be five coefficients each corresponding to each feature.
What is the intercept (bias) of our model?
We have built and trained our model. Let’s predict the target values corresponding to the features in the validation data.
pred = multi.predict(X_val)
We can evaluate the model by calculating the error or R-squared value.
Calculate the R-squared value for both training set and validation set.
multi.score(X_train, y_train), multi.score(X_val, y_val)
With more features, the model’s performance rises up.
Using statsmodels Library
We have used the SciKit-Learn library so far to perform linear regression. However, we can use the statsmodels library to perform the same task. Fit the training data on the OLS (Ordinary Least Squares) model available in the statsmodels library.
import statsmodels.api as sm # add constant (intercept) manually X_train = sm.add_constant(X_train) # fit training data model = sm.OLS(y_train, X_train).fit() model.summary()
It can be observed that the model weights, intercept and the R-squared value are all identical to the Linear Regression method of the SciKit-Learn library.
The model can be implemented to make predictions on validation data too.
# Constant (intercept) must be added manually X_val = sm.add_constant(X_val) preds = results.predict(X_val) mean_squared_error(y_val, preds)
The errors are the same for both the methods!
This notebook contains the above code implementation.
In this article, we have discussed machine learning, its classification, and categorization of supervised learning based on the nature of dependent variables. Further, we explored simple linear regression and multiple linear regression with examples using the SciKit-Learn library. We performed the same task with the statsmodels library and obtained the same results.