# Beginners Guide To Linear Regression In Python

Linear regression is a machine learning task finds a linear relationship between the features and target that is a continuous variable.

Machine Learning is the scientific process of developing an algorithm that learns the pattern from training data and performs inferences on test data. If a machine learning process is meant to predict some output value, it is called supervised learning. On the other hand, if there is no output value prediction, it is called unsupervised learning

Training data in supervised learning contains a set of features and a target. The machine learning algorithm learns from the features to map corresponding targets. Test data contains only features so that the model should predict the targets. Features and targets are also called independent variables and dependent variables, respectively. Training data in unsupervised learning contains only features but not any target. Rather than mapping features and targets as in supervised learning, an unsupervised learning model performs clustering (grouping) the input data based on the patterns among them.

#### THE BELAMY

##### Sign up for your weekly dose of what's up in emerging technology.

Supervised learning is classified into two categories:

Supervised learning is called regression if the dependent variable (aka target) is continuous. Supervised learning is called classification if the dependent variable is discrete. In other words, a regression model outputs a numerical value (a real floating value), but a classification model outputs a class (among two or more classes).

In this article, we discuss linear regression and its implementation with python codes. Regression analysis can be specifically termed linear regression if the dependent variable (target) has a linear relationship with the independent variables (features).

## The Math behind Linear Regression

Suppose a collection of data has two variables: one is the independent variable (X), and another is the dependent variable (Y).

If the relationship between Y and X can be expressed as:

Y = mX + c, this is called linear regression. Here, X is linearly scaled with a weight m to determine the value of Y and c is called bias or y-intercept with which the dependency offsets. A machine learning model has to determine the most suitable values for weight, m and bias, c. If there are more than one independent variable, there will be a corresponding number of weights, w1, w2, w3, and so on.

Typically, a machine learning problem contains a remarkable amount of data. A linear regression model assigns random values to weights and bias at the beginning. When learning commences, the model is fed with one data point in each step. It fits the X values and determines the target. Since weights are randomly assigned initially, the predicted target will differ greatly from the actual target. The model calculates the difference between the actual target value and the predicted target value, which is called the loss. The model scientifically reassigns the values of weights to reduce this loss. With each data point, the model iteratively attempts to find suitable weights that yield minimum loss.

The most preferred losses are mean absolute error (MAE) and mean squared error (MSE). Mean absolute error is the mean value of the sum of differences between predicted and actual target values for all data points. Mean squared error is the mean value of the sum of squares of differences between predicted and actual target values for all data points. Linear regression employs mean squared error (MSE) as its loss function. When learning is finished, the loss value will be at its minimum. In other words, the predicted value will be as close as possible to the actual target value.

We try to get a better understanding in the sequel with a practical problem and hands-on Python implementation.

## Load a Regression Data

Import necessary libraries and modules.

``` import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_diabetes ```

Load a regression problem dataset from SciKit-Learn’s in-built datasets. Data is already preprocessed and normalized, and is ready to use.

``` data = load_diabetes()
data.keys() ```

Output:

Generate features and target. Visualize the top 5 rows of the data.

``` features = pd.DataFrame(data['data'], columns=data['feature_names'])
target = pd.Series(data['target'], name='target')

Output:

## Simple Linear Regression

Simple linear regression is performed with one dependent variable and one independent variable. In our data, we declare the feature ‘bmi’ to be the independent variable.

Prepare X and y.

``` X = features['bmi'].values.reshape(-1,1)
y = target.values.reshape(-1,1) ```

Perform linear regression.

``` simple = LinearRegression()
simple.fit(X,y) ```

The training is completed. We can explore the weight (coefficient) and bias (intercept) of the trained model.

`simple.coef_`

Output:

`simple.intercept_`

Output:

Calculate the predictions following the formula, y = intercept + X*coefficient.

`calc_pred = simple.intercept_ + (X*simple.coef_)`

Predictions can also be calculated using the trained model.

`pred = simple.predict(X)`

We can check whether the calculated predictions and model’s predictions are identical.

`(calc_pred == pred).all()`

Output:

Plot the actual values and predicted values to get a better understanding.

``` # plot actual values
plt.scatter(X,y, label='Actual')
# plot predicted values
plt.plot(X,pred, '-r', label='Prediction')
plt.xlabel('Feature X')
plt.ylabel('Target y')
plt.title('Simple Linear Regression', color='orange', size=14)
plt.legend()
plt.show() ```

Output:

According to SciKit-Learn’s `LinearRegression` method, the above red line is the best possible fit with minimal error value.

We can calculate the mean squared error value for the above regression using the following code.

`mean_squared_error(y, pred)`

Output:

This error value seems too high because of the nature of the actual data. It can be observed from the above plot that the target has multiple values corresponding to a single feature value. The data is highly scattered, which can not be fit completely with a straight line. However, we may wish to conclude how good the fit is. The error just yields an incomparable number.

A parameter named Coefficient of Determination (CoD) is helpful in this case. CoD gives the ratio of the regression sum of square to the total sum of the square. Total sum of squares (SST) is the sum of deviations of each y value from the mean value of y. Regression sum of squares (SSR) is the difference between the total sum of squares and the sum of squared error (SSE). When there is no error (MSE = 0), CoD becomes unity. When the sum of squared error equals the total sum of squares (SSE = SST), CoD becomes zero.

CoD = 1 refers to the best prediction

CoD = 0 refers to the worst prediction

CoD gives a limit [0,1], thus makes the predictions comparable. CoD is also called R-squared value. It can be calculated using the following code.

`simple.score(X,y)`

Output:

With high scatteredness in data, 0.34 is the best possible fit by linear regression.

## Multiple Linear Regression

Multiple linear regression is performed with more than one independent variable. We choose the following columns as our features.

`columns = ['age', 'bmi', 'bp', 's3', 's5']`

Let’s have a look at the data distribution by plotting it.

``` for i in columns:
plt.scatter(features[i], y)
plt.xlabel(str(i))
plt.show()  ```

Output:

It is observed that each individual feature has scatteredness in nature. But, the variation in target values for a single input feature value may be explained by some other features. In other words, the target value may find difficulty in fitting a linear regression model with a single feature. Nevertheless, it may yield an improved fit with multiple features by exploring the true pattern in the data.

In the simple linear regression implementation, we have used all our data to fit the model. But, how can we test our model? How far will our model perform on unforeseen data? This is where the train-test-split comes into play. We split our dataset into two sets: a training set and a validation set. We train our model with training data only and evaluate it with the validation set.

Let’s split the dataset into training and validation sets.

``` from sklearn.model_selection import train_test_split
X = features[columns]
# 70% training data, 30% validation data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=6) ```

Build a linear regression model and fit the data.

``` multi = LinearRegression()
multi.fit(X_train, y_train) ```

What are the weights (coefficients) of our model? There should be five coefficients each corresponding to each feature.

`multi.coef_`

Output:

What is the intercept (bias) of our model?

`multi.intercept_`

Output:

We have built and trained our model. Let’s predict the target values corresponding to the features in the validation data.

`pred = multi.predict(X_val)`

We can evaluate the model by calculating the error or R-squared value.

`mean_squared_error(y_val, pred)`

Output:

Calculate the R-squared value for both training set and validation set.

`multi.score(X_train, y_train), multi.score(X_val, y_val)`

Output:

With more features, the model’s performance rises up.

## Using statsmodels Library

We have used the SciKit-Learn library so far to perform linear regression. However, we can use the statsmodels library to perform the same task. Fit the training data on the OLS (Ordinary Least Squares) model available in the statsmodels library.

``` import statsmodels.api as sm
# add constant (intercept) manually
# fit training data
model = sm.OLS(y_train, X_train).fit()
model.summary() ```

Output:

It can be observed that the model weights, intercept and the R-squared value are all identical to the Linear Regression method of the SciKit-Learn library.

The model can be implemented to make predictions on validation data too.

``` # Constant (intercept) must be added manually
preds = results.predict(X_val)
mean_squared_error(y_val, preds) ```

Output:

The errors are the same for both the methods!

This notebook contains the above code implementation.

## Wrapping Up

In this article, we have discussed machine learning, its classification, and categorization of supervised learning based on the nature of dependent variables. Further, we explored simple linear regression and multiple linear regression with examples using the SciKit-Learn library. We performed the same task with the statsmodels library and obtained the same results.

## More Great AIM Stories

### Top Distributed Training Frameworks In 2021

A geek in Machine Learning with a Master's degree in Engineering and a passion for writing and exploring new things. Loves reading novels, cooking, practicing martial arts, and occasionally writing novels and poems.

## Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### Telegram Channel

Discover special offers, top stories, upcoming events, and more.

### Rendering textured meshes with PyTorch3D API

Rendering is a bridge to the gap between 3D scene attributes and 2D picture pixels.

### Legacy analytics tools are fighting for survival

According to a Gartner report, modernising legacy applications can reduce IT costs by nearly 74 percent.

### The tech powering India’s fastest-growing agri-commerce platform, Arya.ag

The partnership between Arya. ag and Prakshep can offer intelligent and scalable solutions to reimagine the agricultural ecosystem.

### Can bias regularization of neural networks result in underfitting?

Will bias regularization for neural networks lead to underfitting?Is it required for all neural networks.? Here is the answer..

### The return of the prodigal son

Rahul Yadav and ten others founded Housing.com from their hostel rooms at IIT-Bombay.

### Unleash the power of the cloud with Oracle Cloud Infrastructure (OCI)

A natural fit for complex analytics and data storage is the cloud, with the backing of powerful burstable capabilities.

### What is Direct to Mobile technology?

The Department of Technology is conducting a feasibility study of a spectrum band for offering broadcast services directly to users’ smartphones.

### How to obtain a Pandas Dataframe from a gzip file?

Nowadays data is available in various formats and they are mostly zipped due to memory

### Powershap: A Shapley feature selection method

Shapley explains the reason behind an ML model results.

### The murky world of influencer marketing

Guidelines from Advertising Standards Council of India (ASCI) places the onus on the influencers to add disclosure labels on their sponsored products.

[class^="wpforms-"]
[class^="wpforms-"]