In today’s article, we will be taking a look at how to predict the rating of cereals. The problem statement is to predict the cereal ratings where the columns give the exact figures of the ingredients. Link to the data set is mentioned below.
We will be making the data ready to go and will fit a simple model into it and would also regularise the data to see how good the model can become.
#import necessary libraries
import pandas as pd import numpy as np
Now you can download the dataset from here.
It is advised to read the description of the dataset before proceeding, will help you comprehend the problem better.
Extract the data and enter the file path of csv file in it.
df=pd.read_csv('D:\Data Sets\cereal.csv') #reading the file df.head() #for printing the first five rows of the dataset
Output
Here since we see that rating column is a continuous data thus it is a regression problem.
#dropping the rows that are redundant data=df.drop(['name'],axis=1) #to see if there’s any missing data data.isnull().sum() #no missing values #encoding the data from sklearn.preprocessing import LabelEncoder le=LabelEncoder() #label encoding the first two rows for i in range(2): x[:,i]=le.fit_transform(x[:,i])
Output
from scipy.stats import pearsonr corelation=[] for i in range(len(data.columns)-1): col_x=x[:,i] col_y=y corr,_=pearsonr(col_x,col_y) corelation.append(corr) print(corr)
Taking the index values of those whose correlation is greater than 0.29 or less than -0.29
If you don’t know what is correlation then you can study it from here.
drop_col=[] #dropping the columns whose index is the there in the given condition for i in index: data.columns[i] #print(data.columns[i]) drop_col.append(data.columns[i])
Now the independent variable.
x=data.iloc[:,:-1].values #Splitting the dataset from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2) Here the test size is 0.2 and train size is 0.8. from sklearn.linear_model import LinearRegression regressor=LinearRegression() regressor.fit(x_train,y_train) regressor.score(x_test,y_test) #no regularization
Output
0.9943613024056396
It is way too high and is overfitted so we will regularize it.
You can read about regularisation from here.
y_pred=regressor.predict(x_test) #regularizing the linear model from sklearn.linear_model import Ridge ridge_reg_1=Ridge(alpha=1,normalize=True) ridge_reg_1.fit(x_train,y_train) ridge_reg_1.score(x_test,y_test) #alpha =1 ridge_reg_05=Ridge(alpha=0.5,normalize=True) ridge_reg_05.fit(x_train,y_train) ridge_reg_05.score(x_test,y_test) #alpha =0.5 ridge_reg_2=Ridge(alpha=2,normalize=True) ridge_reg_2.fit(x_train,y_train) ridge_reg_2.score(x_test,y_test) #alpha =2
Output
Conclusion
This article was aimed to discuss the problem statement of cereal rating. We had a look at different things including making the data ready for training where we had label encoded our data columns. Not only that but we trained the data using linear regression and then also had regularised it. To tweak and understand it better you can also try different algorithms on the same problem, with that you would not only get better results but also a better understanding of the same.
Hope you liked the article.