MITB Banner

How To Code Linear Regression Models With R

Share

Regression is one of the most common data science problem. It, therefore, finds its application in artificial intelligence and machine learning. Regression techniques are used in machine learning to predict continuous values, for example predicting salaries, ages or even profits. Linear regression is the type of regression in which the correlation between the dependent and independent factors can be represented in a linear fashion.

In this article, we will tailor a template for three commonly-used linear regression models in ML :

  • Simple Linear Regression
  • Multiple Linear Regression
  • Support Vector Machine Regression

Here are the pre-requisites:

Simple Linear Regression

Simple linear regression is the simplest regression model of all. The model is used when there are only two factors, one dependent and one independent.

The model is capable of predicting the salary of an employee with respect to his/her age or experience. Given a dataset consisting of two columns age or experience in years and salary, the model can be trained to understand and formulate a relationship between the two factors. Based on the derived formula, the model will be able to predict salaries for any given age or experience.

Here’s The Code:

The Simple Linear Regression is handled by the inbuilt function ‘lm’ in R.

Creating the Linear Regression Model and fitting it with training_Set

regressor = lm(formula = Y ~ X, data = training_set)

This line creates a regressor and provides it with the data set to train.

*   formula : Used to differentiate the independent variable(s) from the dependent variable.In case of multiple independent variables, the variables are appended using ‘+’ symbol. Eg. Y ~ X1 +  X2 + X3 + …

*  X: independent Variable or factor. The column label is specified

*  Y: dependent Variable.The column label is specified.

*  data : The data the model trains on, training_set.

Predicting the values for test set

predicted_Y = predict(regressor, newdata = test_set)

This line predicts the values of dependent factor for new given values of independent factor.

*   regressor : the regressor model that was previously created for training.

*   newdata : the new set of observations that you want to predict Y for.

Visualizing training set predictions

install.packages('ggplot2')  # install once
library(ggplot2)   # importing the library
ggplot() +
geom_point(aes(x = training_set$X, y = training_set$Y), colour = 'black') +
geom_line(aes(x = training_set$X, y = predict(regressor, newdata = training_set)),colour = 'red') +
ggtitle('Y vs X (Training Set)')
xlab('X')
ylab('y')

Visualizing test set predictions

ggplot() +
geom_point(aes(x = test_set$X, y = test_set$Y), colour = 'blue') +
geom_line(aes(x = training_set$X, y = predict(regressor, newdata = training_set)),colour = 'red') +
ggtitle('Y VS X (Test Set)')
xlab('X')
ylab('Y')

These two  blocks of code represent the dataset in a graph. ggplot2 library is used for plotting the data points and the regression line.

The first block is used for plotting the training_set and the second block for the test_set predictions.

*   geom_point() : This function scatter plots all data points in a 2 Dimensional graph

*   geom_line() : Generates or draws the regression line in 2D graph

*   ggtitle() : Assigns the title of the graph

*   xlab : Labels the X- axis

*   ylab : Labels the Y-axis

Replace all X and Y with the Independent and dependent factors (Column labels) respectively.

 

Multiple Linear Regression

Multiple Linear Regression is another simple regression model used when there are multiple independent factors involved. So unlike simple linear regression, there are more than one independent factors that contribute to a dependent factor. It is used to explain the relationship between one continuous dependent variable and two or more independent variables. The independent variables can be continuous or categorical (dummy variables).

Unlike simple linear regression where we only had one independent variable, having more independent variables leads to another challenge of identifying the one that shows more correlation to the dependent variable. Backward Elimination is one method that can help us identify the independent variables with strongest relation to the dependent variable. In this method, a significance Level is chosen. Most commonly it’s 0.05. The regressor model returns a P value for each independent factor/variable. The variable with P Value greater than the chosen Significance Level is removed and P values are updated. The process is iterated until the strongest factor is obtained.

This model can be used to predict the salary of an employee against multiple factors like experience, employee_score etc.

Here’s The Code:

The Multiple Linear Regression is also handled by the function lm.

Creating the Multiple Linear Regressor and fitting it with Training Set

regressor = lm(Y ~ .,data = training_set)

The expression ‘Y ~ .” takes all variables except Y in the training_set as independent variables.

Predicting the values for test set

predicted_Y = predict(regressor, newdata = test_set)

Using Backward Elimination to Find the most significant Factors

backwardElimination <- function(x, sl) {
numVars = length(x)
for (i in c(1:numVars)){
regressor = lm(formula = Y ~ ., data = x)
maxVar = max(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"])
if (maxVar > sl){
j = which(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"] == maxVar)
x = x[, -j]
}
numVars = numVars - 1
}
return(summary(regressor))
}
SL = 0.05
dataset = dataset[, c(indexes of independent factors separated by a coma)]
backwardElimination(dataset, SL)

This block identifies the most significant independent factor by using Backward Elimination method.The independent variable with a greater P value than the chosen Significance Level is removed iteratively until the most Significant variable remains.

Support Vector Regression

Support Vector Regression is a subset of Support Vector Machine (SVM) which is a classification model. Unlike SVM used for predicting binary categories, SVR uses the same principle to predict continuous values.

Here’s The Code:

The package e1071 is used for handling Support Vector Regression in R

Installing and Importing the Library

install.packages('e1071') #install once
library(e1071) #importing the library

Creating the Support Vector Regressor and fitting it with Training Set

svr_regressor = svm(formula = Y ~ ., data = training_set, type = 'eps-regression')

This line creates a Support Vector Regressor and provides the data to train.

*   type : one of two types. ‘eps-regression’ denotes that this is a regression problem

Predicting the values for test set

predicted_Y = predict(svr_regressor, newdata = test_set)

Outlook

The R programming language has been gaining popularity in the ever-growing field of AI and Machine Learning. The language has libraries and extensive packages tailored to solve real real-world problems and has thus proven to be as good as its competitor Python. Linear Regression models are the perfect starter pack for machine learning enthusiasts. This tutorial will give you a template for creating three most common Linear Regression models in R that you can apply on any regression dataset.

Share
Picture of Amal Nair

Amal Nair

A Computer Science Engineer turned Data Scientist who is passionate about AI and all related technologies. Contact: amal.nair@analyticsindiamag.com
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.