Data Science has opened up a myriad of opportunities in the past couple of years. It quickly topped the list of most wanted jobs and has witnessed the younger generation swarming for courses and jobs, However, unlike most domains, data science is one such field where an individual has to have a peculiar set of skills. From knowledge of linear algebra to storytelling, from programming to business case studies, the role of a data scientist varies between a statistician and an algorithm developer.
No matter how much domain knowledge one has gained, it eventually comes down to the hours of practice put into mastering data handling. This is usually achieved through workshops and hackathons.
Sign up for your weekly dose of what's up in emerging technology.
Analytics India Magazine provides one such opportunity to beginners and to those looking towards a career transition, through its very relevant hackathons. Participants get to compete with high-level data science aspirants as well as practitioners. Not to forget the exciting prizes one gets to win!
“I was more interested in applied after gaining the theoretical knowledge but the lectures limited themselves to theory,”– Abhishek Thakur, world’s first Kaggle triple grandmaster
These hackathons cover a wide variety of domains from the classic Regression and Classification problems to Natural Language Processing and Image Classification and so on.
Data Science principles are applied to Finance, Entertainment, Healthcare, Defence, Communications, E-commerce, Business and many more.
Today, we are going to pick one from the very popular use case of food delivery. India has witnessed the rise of food delivery services in the form of Zomato and Swiggy. This multi-billion dollar industry keeps on benefiting with every optimisation.
In our latest hackathon “Predicting Food Delivery Time – Hackathon by IMS Proschool”, we challenged the aspirants and developers to bring out their best algorithms from their armoury.
In this article, we shall take the reader through the nuts and bolts of solving a data science problem in a stepwise fashion.
So how can anyone who wishes to pursue the path of a Data Scientist start the journey? Practising is answer and Hackathons are the solutions. That is why we at MachineHack provide the young generation with an opportunity to apply everything that they have learned in all kinds of problems.
We help thousands of students and expert Data Science practitioners by giving them an opportunity to sharpen their Data Science skills.
Solving Your First Ever Data Science Hackathon
When was the last time you ordered food online? And how long did it take to reach you?
In this hackathon, we are provided with data from thousands of restaurants in India and the time they take to deliver food for online order. As data scientists, our goal is to predict the online order delivery time for the given test data based on the given factors.
In this tutorial, we will try to crack MachineHack’s latest Hackathon called ‘Predicting Food Delivery Time – Hackathon by IMS Proschool’.
The Hackathon is brought to you by IMS Proschool and MachineHack
IMS, since 1977, has worked towards building a long term successful career for its students. It emerged as the fourth most trusted education brands in an AC Nielsen and Brand Equity Survey. IMS Proschool is the extension of the same mission. Proschool helps individuals realize their potential by mentoring and imparting skills.
Problem Statement: Predict the delivery time for a restaurant based on the restaurant, it’s location, the cuisines etc. using the given data.
Downloading The Data
To download the data, head to www.MachineHack.com and sign up. Click on the Hackathons tab to go to hackathons page.
Select the Predicting Food Delivery Time – Hackathon by IMS Proschool’ and start the course to download the datasets at the “Hackathon Dashboard”. See the below image.
Let us have a look at the features:
- Restaurant: A unique ID that represents a restaurant.
- Location: The location of the restaurant.
- Cuisines: The cuisines offered by the restaurant.
- Average_Cost: The average cost for one person/order.
- Minimum_Order: The minimum order amount.
- Rating: Customer rating for the restaurant.
- Votes: The total number of customer votes for the restaurant.
- Reviews: The number of customer reviews for the restaurant.
- Delivery_Time: The order delivery time of the restaurant. (Target Classes)
Size of training set: 11,094 records
Size of test set: 2,774 records
Understanding The Problem
If AI is the future then Data is the fuel of tomorrow. Without data, AI is only as good as a plain old ‘if-else’ statement. Within a large amount of data, lies hidden highly useful information that can determine an organization’s growth or warn us about an upcoming calamity.
Companies like Uber eats, Swiggy, Zomato and many others in the food industry have been using data to optimize their decision making process such as which restaurant to recommend, what kinds of dishes to suggest and many such.
In this hackathon, we need not optimize anything rather its very simple. Our objective is just to predict the delivery time for a restaurant based on the restaurant, it’s location, the cuisines it offers, the average cost of food, minimum order cost, the rating, the customer votes and reviews.
To approach any problem we must have proper planning. We will create a pipeline and brake down the solution into four simple stages.
- Exploring the data and its features
- Data Cleaning
- Data Preprocessing
- Modeling and Predicting
The following example was done on Google colab. Google colab is a very helpful tool by google that is built for Data Scientists. It is similar to a Jupyter Notebook but served and powered by google. It has its own kernel and all the codes are executed in Google’s own cloud and has support for GPUs.
To mount your Google Drive to access files in the Drive from Colab, execute the following piece of code in Colab before you begin:
from google.colab import drive
Upload the participant’s data downloaded from MachineHack into your Google Drive directory.
To load the dataset, use the mounted directory followed by the path to the files in your Google Drive.
See the example below:
train = pd.read_excel("/GD/My Drive/Colab Notebooks/Food_delivery_Time_Prediction/DataSets/Data_Train.xlsx")
Exploring The Data
We already know and are clear with what the data is about, the features that come with it and also the objective which is to predict the delivery time.
But should explore the data. This is a critical step in Data Science. Without exploring, Cleaning and preprocessing stages will become a chaos factory.
Looking at the data:
To explore the dataset thoroughly we will try to find answers to the following questions.
- What type of data does each column have?
- Does the table contain any missing or null values?
- Does any column contain multiple pieces of data that can be used to generate new features?
- Can new features be deduced from the existing columns?
- What are the categorical variables that need to be encoded?
- Does any column contain values that are irrelevant or have no significance to the context?
Although the approach may vary, these questions form the baseline for solving any problem.
Listed below are some very useful codes that can help us understand data:
Key Observations :
- The Location and Cuisines column contains multiple values separated by commas.
- The Average_Cost and Minimum Order column consist of symbols and are strings.
- The Rating, Votes and Reviews column consists of invalid values such as ‘-’, “NEW’ etc.
- Restaurant, Location and Cuisines are categorical variables
Data Cleaning is a very important stage that can directly account for the efficiency of a machine learning model. Based on the insights gathered from the exploration stage, we will clean the data.
This stage requires extensive coding. It is advisable to write functions that generalize well for similar kinds of data. Sometimes this is not possible and we will have to process and clean a column or feature separately.
Given below are two functions to clean the ‘Locations’ and ‘Cuisines’ features of our dataset. We will split each cell of the Location and Cuisines column into the maximum number of features found in a single cell.
To find the maximum features in a single cell, we must consider both training and test data because the number of independent features must remain the same.
The below two functions clean the ‘Ratings’, ‘Votes’ and ‘Reviews’ features of our dataset.
Refer the Complete Code section below for detailed Data Cleaning.
Let’s have a look at the cleaned data
Once cleaned we can finally visualize the data. Let’s take a quick look at a pair-plot that shows the relationship between each of the numerical variables or columns in the training data. Pair-plot is one of the easiest ways to plot and identify the relationship between features in a dataset.
#Relation between the numeric features in the dataset
Data preprocessing, just like cleaning has an impact on a models performance. The processing stage consists mainly of the following processes
- Dealing with NULLS/NaNs.
- Dealing with categorical values.
- Normalizing or scaling.
Listed below are some helpful packages and methods that helps greatly with the preprocessing stage:
Modeling and Predicting
Finally, we are on to building a simple classifier that can predict and evaluate on our sample data.
We will use a simple XGBoost classifier without any parameter tuning. This is a good starting point.
Before we begin to create a model, make sure we have a small dataset to test our model performance. The best approach is to split the training set into a training set and a validation set.
Also, it is important to separate out the independent and dependent variables from all the dataset samples.
To import the xgboost module, it must first be installed. Install it using
!pip install xgboost
What the following module does:
- Splits the training data into a training set and a validation set
- Separates the dependent and independent features for the training set and validation set
- Initializes an XGBoost classifier
- Trains the classifier with the training data
- Evaluates the score on a validation set
- Predicts the classes for the test set.
Execute the above code and upload your solution at MachineHack to see your score !!