XGBoost is an efficient technique for implementing gradient boosting. When talking about time series modelling, we generally refer to the techniques like ARIMA and VAR models. XGBoost, as a gradient boosting technique, can be considered as an advancement of traditional modelling techniques. In this article, we will learn how we can apply gradient boosting with the XGBoost technique for effective time series modelling. The major points to be discussed in this article are listed below.
Table of contents
THE BELAMY
Sign up for your weekly dose of what's up in emerging technology.
- What is gradient boosting?
- What is XGBoost?
- Using XGBoost in time series
- The procedure
- Data analysis
- Data conversion
- Model fitting
- Forecasting
Let’s start with having a brief introduction to gradient boosting.
What is gradient boosting?
In machine learning, gradient boosting is an algorithm that helps in performing regression and classification tasks. Using the ensemble of weak prediction models, gradient boosting helps us in making predictions. Examples of weak models can be decision trees. Ensembled models using weak tree learners can be considered gradient-boosted trees. Gradient-boosted trees are comparable to the random forest even though they can perform better than random forests if finely tuned. It helps in generalizing the other model by optimizing the arbitrary differential loss function.
What is XGBoost?
XGBoost is a library that can help us regularize gradient boosting in different languages like python, R, Julia, c++, and Java. XGBoost stands for extreme gradient boosting machine. As software, the main focus of XGBoost is to speed up and increase the performance of gradient boosted decision trees. This software can provide us with scalable, portable, and distributed gradient boosting. Using this library we can utilize the functionality in single and distributed processing machines.
Using XGBoost in time series
As we discussed in the above section, gradient boosting is majorly focused on improving the performance of the machine learning models and with gradient boosting using XGBoost we can speed up the procedure as well as we can get better results. When we talk about the field of time series analysis and forecasting we use traditional models like ARIMA(autoregressive integrated moving average) models where the main focus or the model is on regression analysis and if we can perform this regression with such software and technique we can also achieve a state of the art performance in the time series modelling. Ensemble of weak machine learning models to regularized gradient boosting can help us in improving the results in every section of data science. The section can also be time series. In this article, we will see how we can make XGBoost perform in time series modelling.
The procedure
In the procedure, we are going to use data from Kaggle which is a Take-Away Food Orders data. We can find the data here. In this data, we have details about the orders with the date of the order and quantity in the order. Using this information we will be predicting the order count for the next few dates. For the procedure, we will be using the python language and some basic libraries like NumPy, pandas, matplotlib, and sklearn with this we will use XGBoost software. Let’s start the procedure, as we will go far in the process we will be knowing how our analysis of the data is and how we can use XGBoost for forecasting.
Data analysis
Let’s start with loading the data.
import pandas as pd
data = pd.read_csv('/content/drive/MyDrive/Yugesh/time series analysis with xgboost/restaurant-1-orders.csv')
data.columns = ['number', 'date', 'item', 'quantity', 'price', 'total_items']
data.head()
Output:

Here in the data, we can see that we have the order number, date, item name, price, quantity, and total items in the order. Now let’s convert the date values into date-time values.
data['date'] = pd.to_datetime(data['date'].str[:10])
data.head()
Output:

Here we can understand that to make predictions on the order count we are required to have the order number, total items, and date in the data. Let’s extract these values from our data.
order_data = data[['number','total_items', 'date']]
order_data
Output:

Here we can understand that we are required to have data where we can see how many orders we have in a day. Let’s perform some more operations on data to get daily frequency selling of the items.
res = newdata['number'].nunique()
res=res.to_frame()
res
Output:

Here using the order number column we have found the number or unique order and the value count. Our date column is now an index of data and this is how we have converted our data into time series. Let’s plot the data.
res.plot()
Output:

Here in the plot, we can see that our time series is very scattered, and also we are getting 5 to 15 orders every day. Let’s make the visualization clear.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
color_pal = ["#F8766D", "#D39200", "#93AA00", "#00BA38", "#00C19F", "#00B9E3", "#619CFF", "#DB72FB"]
_ = res.plot(style='.', figsize=(15,5), color=color_pal[0], title='sale')
Output:

Things are much clearer now we can say now we have the main density of order numbers in the range between 5 to 10 or 15. Now our data requires some conversions so that we can fit it into the model.
Data conversions
Let’s split our data.
split_date = '01-Jan-2019'
data_train = res.loc[res.index <= split_date].copy()
data_test = res.loc[res.index > split_date].copy()
In this, we have splitted our data into train and test data after the date 01-Jan-2019. So that we can check our model after the year 2018. Let’s see how our split datasets are.
data_train
Output:

data_test
Output:

Let’s plot our split data.
_ = data_test \
.rename(columns={'number': 'TEST SET'}) \
.join(data_train.rename(columns={'number': 'TRAINING SET'}), how='outer') \
.plot(figsize=(15,5), title='sale', style='.')
Output:

Here we can see that our datasets have different colours. After this, we can make a function that can create time-series features from our data.
def create_features(df, label=None):
df['date'] = df.index
df['hour'] = df['date'].dt.hour
df['dayofweek'] = df['date'].dt.dayofweek
df['quarter'] = df['date'].dt.quarter
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
df['dayofyear'] = df['date'].dt.dayofyear
df['dayofmonth'] = df['date'].dt.day
df['weekofyear'] = df['date'].dt.weekofyear
X = df[['hour','dayofweek','quarter','month','year',
'dayofyear','dayofmonth','weekofyear']]
if label:
y = df[label]
return X, y
return X
Let’s use the function in our data.
X_train, y_train = create_features(data_train, label='number')
X_test, y_test = create_features(data_test, label='number')
X_train
Output:

In the above output, we can see what are the features we have in our data for training. Since XGBoost is a process of supervised learning to work with it we are required to make our data the data for supervised learning.
Model fitting
Let’s import the XGBoost and other libraries to optimize the process.
import xgboost as xgb
from xgboost import plot_importance, plot_tree
from sklearn.metrics import mean_squared_error, mean_absolute_error
Instantiating our model
reg = xgb.XGBRegressor(n_estimators=1000)
Fitting our data into the model.
reg.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
early_stopping_rounds=50,
verbose=False)
Before making predictions on the test data we can also follow the process of feature selection. In this process, we can do this using the feature importance technique. This process will help us in finding the feature from the data the model is relying on most to make the prediction. One more thing which is important here is that we are using XGBoost which works based on splitting data using the important feature. So finding the important feature will make the background process much clearer.
_ = plot_importance(reg, height=0.9)
Output:

In the above output, we can see that day of the year is our most important feature and the model has used it most commonly to split on more nodes. Feature Month has the lowest importance.
Forecasting
After fitting the model and feature selection process we can make predictions using our test data and model. Using the below lines of codes we can do this.
data_test['number_Prediction'] = reg.predict(X_test)
data_all = pd.concat([data_test, data_train], sort=False)
Now lets plot our prediction
_ = data_all[['number','number_Prediction']].plot(figsize=(15, 5))
Output:

Here we can see our prediction. As we have discussed, the order values are moving around the range between 5 to 15 and our model is also predicting between this range. Using those predictions we can tell the restaurant to prepare around this range only so that we can reduce the wastage of food and maximize profit.
Final words
In this article, we have gone through the process of applying XGBoost in time series modelling and forecasting. Also, with this, we have discussed some of the data analysis processes that can be helpful in solving real-life problems.