Model Selection With K-fold Cross Validation — A Walkthrough with MachineHack’s Food Cost Prediction Hackathon

A Data Scientist goes through a lot of data that needs to be cleaned, pre-processed, modelled and visualised. All these processes may not be as simple as it may sound. Out of all these processes, at least some Data Science experts may agree that modelling is one of the easiest to get around as long as you have built a model before or maybe have a template to implement one easily.

There are lots of models available today, easy to implement with tons of built-in library supports, from simple to complex mathematical calculations all performed by just calling the name of a function. Gradient Boosting, Logistic regression, SVMs, all just happens to be there to help you, just call for it. Looking at the bright side, we are presented with lots of options with no need to write complex code for mathematical calculations, but look again and with a big problem statement and a bigger dataset in hand, you will see that more choices mean more headache.

The choice of a model can be broken down to some extent by understanding the problem and the data you are presented with. But What model to choose and which will give more optimal result is always a Data Scientist’s most self asked question and high variance in predictions with the same model is a Data Scientist’s worst nightmare.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

This article is a direct dive into the implementation of K-Fold Cross-Validation and hence, readers are expected to have a basic idea about how K-Fold Cross Validation works. Use this free guide to understand K-Fold Cross-Validation.

K-Fold Cross Validation With MachineHack’s Food Cost Prediction Hackathon

If you are beginning your Data Science journey and you end up asking that question at some early point in your journey, this article will help you to a great deal. In this article, we will use K-Fold Cross-validation to find out which model best fits a given data set and has a higher probability of giving a better accuracy on your predictions.

Where to get the Data Sets?

Head to MACHINEHACK’s Predicting Restaurant Food Cost Hackathon hackathon by clicking here. Sign Up and start the course. You will find the data set as PARTICIPANTS_DATA_FINAL in the Attachments section.

Having trouble finding the data set? Click here to go through the tutorial to help yourself.

Let us begin by stating a simple definition for K-Fold Cross Validation:

K-Fold Cross Validation involves, training a specific model with (k -1) different folds or samples of a limited dataset and then  testing the results on one sample. For example, if K = 10, then the first sample will be reserved for the purpose of validating the model after it has been fitted with the rest of (10 – 1) = 9 samples/Folds.

Lets Code!

Loading And Cleaning the Data:

Since there is already a tutorial on solving the hackathon, I will jump directly to implementing the K-Fold Cross-validation.

Click here to check out the tutorial for Cleaning and preprocessing the data. Follow the tutorial till the Data Preprocessing(Including Data Preprocessing) stage.

Choosing The Right Model With K-Fold Cross Validation

After the Data Preprocessing Stage, the data is now ready to be fitted to a model, but which one?

We will choose three random algorithms and will employee K-Fold Cross Validation to determine which one is the best.

1. XGBoost

We will use the xgboost library. Import the XGBRegressor and fit the training data – X_train and Y_train.

from xgboost import XGBRegressor
xgbr = XGBRegressor(), Y_train)

The model is now fitted with the data, all we need to do is perform cross-validation to determine the average accuracy we can expect from the xgbr model on different test sets.

The below block uses the cross_val_score method from scikit-learn’s model_selection package for K-Fold Cross-Validation.

from sklearn.model_selection import cross_val_score
XGB_accuracies = cross_val_score(estimator = xgbr, X = X_train, y = Y_train, cv = 10)
print("Mean_XGB_Acc : ", XGB_accuracies.mean())

The cross_val_score takes the model to be validated (xgbr), X_train, Y_train and a parameter cv as arguments. cv = 10 implies it is a k=10 fold cross validation meaning that 10 folds or samples are created and validated. The method will return an array of values which are the accuracy returned by the model on 10 samples/folds.


Mean_XGB_Acc :  0.6974719315431506
Which implies that the XGBRegressor will give a prediction with an average accuracy of 69% when tested against different data sets. You can also find the standard_deviation by executing


2. Random Forest

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=307, random_state=1), Y_train)

from sklearn.model_selection import cross_val_score
RF_accuracies = cross_val_score(estimator = rf, X = X_train, y = Y_train, cv = 10)
print("Mean_RF_Acc : ", RF_accuracies.mean())


Mean_RF_Acc :  0.7129263668673727

3.  Gradient Boosting Regressor

from sklearn.ensemble import GradientBoostingRegressor
gbr=GradientBoostingRegressor( loss = 'huber',learning_rate=0.07,n_estimators=350, max_depth=6,subsample=1,verbose=False),Y_train)

from sklearn.model_selection import cross_val_score
GB_accuracies = cross_val_score(estimator = gbr, X = X_train, y = Y_train, cv = 10)
print("Mean_GB_Acc : ", GB_accuracies.mean())


Mean_GB_Acc :  0.7256236594127805

The Better Model

By comparing the outputs of the 3 models we can conclude that the GradientBoostingRegressor has a slightly higher probability of giving a better prediction in terms of accuracy.

Also, use the following links to our top tutorials to help you with MachineHack’s Hackathons :

  1. Flight Ticket Price Prediction Hackathon: Use These Resources To Crack Our MachineHack Data Science Challenge
  2. Hands-on Tutorial On Data Pre-processing In Python
  3. Data Preprocessing With R: Hands-On Tutorial
  4. Getting started with Linear regression Models in R
  5. How To Create Your first Artificial Neural Network In Python
  6. Getting started with Non Linear regression Models in R
  7. Beginners Guide To Creating Artificial Neural Networks In R
Amal Nair
A Computer Science Engineer turned Data Scientist who is passionate about AI and all related technologies. Contact:

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox