MITB Banner

Hands-on Linear Regression Using Sklearn

In today’s article, we will be taking a look at how to predict the rating of cereals. The problem statement is to predict the cereal ratings where the columns give the exact figures of the ingredients. Link to the data set is mentioned below.

Share

In today’s article, we will be taking a look at how to predict the rating of cereals. The problem statement is to predict the cereal ratings where the columns give the exact figures of the ingredients. Link to the data set is mentioned below. 

We will be making the data ready to go and will fit a simple model into it and would also regularise the data to see how good the model can become.

#import necessary libraries

import pandas as pd
import numpy as np

Now you can download the dataset from here.

It is advised to read the description of the dataset before proceeding, will help you comprehend the problem better.

Extract the data and enter the file path of csv file in it.

df=pd.read_csv('D:\Data Sets\cereal.csv') #reading the file
df.head() #for printing the first five rows of the dataset

Output

Here since we see that rating column is a continuous data thus it is a regression problem. 

#dropping the rows that are redundant
data=df.drop(['name'],axis=1)
#to see if there’s any missing data
data.isnull().sum() #no missing values
#encoding the data
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
#label encoding the first two rows
 for i in range(2):
   x[:,i]=le.fit_transform(x[:,i])

Output

from scipy.stats import pearsonr
corelation=[]
for i in range(len(data.columns)-1):
  col_x=x[:,i]
  col_y=y
  corr,_=pearsonr(col_x,col_y)
  corelation.append(corr)
  print(corr)

Taking the index values of those whose correlation is greater than 0.29 or less than -0.29

If you don’t know what is correlation then you can study it from here.

drop_col=[]
#dropping the columns whose index is the there in the given condition
for i in index:
  data.columns[i]
  #print(data.columns[i])
  drop_col.append(data.columns[i])

Now the independent variable.

x=data.iloc[:,:-1].values
#Splitting the dataset
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
Here the test size is 0.2 and train size is 0.8. 
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(x_train,y_train)
regressor.score(x_test,y_test) #no regularization 

Output

0.9943613024056396

It is way too high and is overfitted so we will regularize it.

You can read about regularisation from here.

y_pred=regressor.predict(x_test)
#regularizing the linear model
from sklearn.linear_model import Ridge
ridge_reg_1=Ridge(alpha=1,normalize=True)
ridge_reg_1.fit(x_train,y_train)
ridge_reg_1.score(x_test,y_test)   #alpha =1
ridge_reg_05=Ridge(alpha=0.5,normalize=True)
ridge_reg_05.fit(x_train,y_train)
ridge_reg_05.score(x_test,y_test)   #alpha =0.5
ridge_reg_2=Ridge(alpha=2,normalize=True)
ridge_reg_2.fit(x_train,y_train)
ridge_reg_2.score(x_test,y_test)    #alpha =2

Output

Conclusion

This article was aimed to discuss the problem statement of cereal rating. We had a look at different things including making the data ready for training where we had label encoded our data columns. Not only that but we trained the data using linear regression and then also had regularised it. To tweak and understand it better you can also try different algorithms on the same problem, with that you would not only get better results but also a better understanding of the same.

Hope you liked the article.

Share
Picture of Bhavishya Pandit

Bhavishya Pandit

Understanding and building fathomable approaches to problem statements is what I like the most. I love talking about conversations whose main plot is machine learning, computer vision, deep learning, data analysis and visualization. Apart from them, my interest also lies in listening to business podcasts, use cases and reading self help books.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.