How to use XGBoost for time-series analysis?

Using XGBoost for time-series analysis can be considered as an advance approach of time series analysis. this approach also helps in improving our results and speed of modelling.

XGBoost is an efficient technique for implementing gradient boosting. When talking about time series modelling, we generally refer to the techniques like ARIMA and VAR models. XGBoost, as a gradient boosting technique, can be considered as an advancement of traditional modelling techniques. In this article, we will learn how we can apply gradient boosting with the XGBoost technique for effective time series modelling. The major points to be discussed in this article are listed below.

Table of contents

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.
  1. What is gradient boosting?
  2. What is XGBoost?
  3. Using XGBoost in time series
  4. The procedure
    1. Data analysis
    2. Data conversion 
    3. Model fitting 
    4. Forecasting 

Let’s start with having a brief introduction to gradient boosting.

What is gradient boosting?

In machine learning, gradient boosting is an algorithm that helps in performing regression and classification tasks. Using the ensemble of weak prediction models, gradient boosting helps us in making predictions. Examples of weak models can be decision trees. Ensembled models using weak tree learners can be considered gradient-boosted trees. Gradient-boosted trees are comparable to the random forest even though they can perform better than random forests if finely tuned. It helps in generalizing the other model by optimizing the arbitrary differential loss function. 

What is XGBoost?

XGBoost is a library that can help us regularize gradient boosting in different languages like python, R, Julia, c++, and Java. XGBoost stands for extreme gradient boosting machine. As software, the main focus of XGBoost is to speed up and increase the performance of gradient boosted decision trees.  This software can provide us with scalable, portable, and distributed gradient boosting. Using this library we can utilize the functionality in single and distributed processing machines.  

Using XGBoost in time series

As we discussed in the above section, gradient boosting is majorly focused on improving the performance of the machine learning models and with gradient boosting using XGBoost we can speed up the procedure as well as we can get better results. When we talk about the field of time series analysis and forecasting we use traditional models like ARIMA(autoregressive integrated moving average) models where the main focus or the model is on regression analysis and if we can perform this regression with such software and technique we can also achieve a state of the art performance in the time series modelling. Ensemble of weak machine learning models to regularized gradient boosting can help us in improving the results in every section of data science. The section can also be time series. In this article, we will see how we can make XGBoost perform in time series modelling. 

The procedure

In the procedure, we are going to use data from Kaggle which is a Take-Away Food Orders data. We can find the data here.  In this data, we have details about the orders with the date of the order and quantity in the order. Using this information we will be predicting the order count for the next few dates. For the procedure, we will be using the python language and some basic libraries like NumPy, pandas, matplotlib, and sklearn with this we will use XGBoost software. Let’s start the procedure, as we will go far in the process we will be knowing how our analysis of the data is and how we can use XGBoost for forecasting.      

Data analysis

Let’s start with loading the data.

import pandas as pd
data = pd.read_csv('/content/drive/MyDrive/Yugesh/time series analysis with xgboost/restaurant-1-orders.csv')
data.columns = ['number', 'date', 'item', 'quantity', 'price', 'total_items']
data.head()

Output:

Here in the data, we can see that we have the order number, date, item name, price, quantity, and total items in the order. Now let’s convert the date values into date-time values.

data['date'] = pd.to_datetime(data['date'].str[:10])
data.head()

Output:

Here we can understand that to make predictions on the order count we are required to have the order number, total items, and date in the data. Let’s extract these values from our data.

order_data = data[['number','total_items', 'date']]
order_data

Output:

Here we can understand that we are required to have data where we can see how many orders we have in a day. Let’s perform some more operations on data to get daily frequency selling of the items.

res = newdata['number'].nunique()
res=res.to_frame()
res

Output:

Here using the order number column we have found the number or unique order and the value count. Our date column is now an index of data and this is how we have converted our data into time series. Let’s plot the data. 

res.plot()

Output:

Here in the plot, we can see that our time series is very scattered, and also we are getting 5 to 15 orders every day. Let’s make the visualization clear.

import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
color_pal = ["#F8766D", "#D39200", "#93AA00", "#00BA38", "#00C19F", "#00B9E3", "#619CFF", "#DB72FB"]
_ = res.plot(style='.', figsize=(15,5), color=color_pal[0], title='sale')

Output:

Things are much clearer now we can say now we have the main density of order numbers in the range between 5 to 10 or 15. Now our data requires some conversions so that we can fit it into the model. 

Data conversions

Let’s split our data.

split_date = '01-Jan-2019'
data_train = res.loc[res.index <= split_date].copy()
data_test = res.loc[res.index > split_date].copy()

In this, we have splitted our data into train and test data after the date 01-Jan-2019. So that we can check our model after the year 2018. Let’s see how our split datasets are.

data_train

Output:

data_test

Output:

Let’s plot our split data.

_ = data_test \
    .rename(columns={'number': 'TEST SET'}) \
    .join(data_train.rename(columns={'number': 'TRAINING SET'}), how='outer') \
    .plot(figsize=(15,5), title='sale', style='.')

Output:

Here we can see that our datasets have different colours. After this, we can make a function that can create time-series features from our data.

def create_features(df, label=None):
    df['date'] = df.index
    df['hour'] = df['date'].dt.hour
    df['dayofweek'] = df['date'].dt.dayofweek
    df['quarter'] = df['date'].dt.quarter
    df['month'] = df['date'].dt.month
    df['year'] = df['date'].dt.year
    df['dayofyear'] = df['date'].dt.dayofyear
    df['dayofmonth'] = df['date'].dt.day
    df['weekofyear'] = df['date'].dt.weekofyear
    
    X = df[['hour','dayofweek','quarter','month','year',
           'dayofyear','dayofmonth','weekofyear']]
    if label:
        y = df[label]
        return X, y
    return X

Let’s use the function in our data.

X_train, y_train = create_features(data_train, label='number')
X_test, y_test = create_features(data_test, label='number')
X_train

Output:

In the above output, we can see what are the features we have in our data for training. Since XGBoost is a process of supervised learning to work with it we are required to make our data the data for supervised learning. 

Model fitting 

Let’s import the XGBoost and other libraries to optimize the process.

import xgboost as xgb
from xgboost import plot_importance, plot_tree
from sklearn.metrics import mean_squared_error, mean_absolute_error

Instantiating our model

reg = xgb.XGBRegressor(n_estimators=1000)

Fitting our data into the model.

reg.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        early_stopping_rounds=50,
       verbose=False)

Before making predictions on the test data we can also follow the process of feature selection. In this process, we can do this using the feature importance technique. This process will help us in finding the feature from the data the model is relying on most to make the prediction. One more thing which is important here is that we are using XGBoost which works based on splitting data using the important feature. So finding the important feature will make the background process much clearer. 

_ = plot_importance(reg, height=0.9)

Output:

In the above output, we can see that day of the year is our most important feature and the model has used it most commonly to split on more nodes. Feature Month has the lowest importance. 

Forecasting 

After fitting the model and feature selection process we can make predictions using our test data and model.  Using the below lines of codes we can do this.

data_test['number_Prediction'] = reg.predict(X_test)
data_all = pd.concat([data_test, data_train], sort=False)

Now lets plot our prediction

_ = data_all[['number','number_Prediction']].plot(figsize=(15, 5))

Output:

Here we can see our prediction. As we have discussed, the order values are moving around the range between 5 to 15 and our model is also predicting between this range. Using those predictions we can tell the restaurant to prepare around this range only so that we can reduce the wastage of food and maximize profit.  

Final words

In this article, we have gone through the process of applying XGBoost in time series modelling and forecasting. Also, with this, we have discussed some of the data analysis processes that can be helpful in solving real-life problems.   

References

More Great AIM Stories

Yugesh Verma
Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.

Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Masterclass, Virtual
How to power applications for the data-driven economy
20th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

What can SEBI learn from casinos?

It is said that casino AI technology comes with superior risk management systems compared to traditional data analytics that regulators are currently using.

Will Tesla Make (it) in India?

Tesla has struggled with optimising their production because Musk has been intent on manufacturing all the car’s parts independent of other suppliers since 2017.