Time series data is a collection of data points obtained in a sequence with time values. These time values can be regular periods or irregular. We use time-series data to predict the future data responses, which are based on past data. Generally, in a time series, some unusual effect of seasonality or trends and noise makes the prediction wrong. For better forecasting with time series, we need a stationary time series data set in which the effect of trends or seasons is negligible. In the article, we will discuss the seasonality of time series data and remove it.
Types of Time-Series
There are two types of time series: Additive and multiplicative. To understand both better, we need to know trends, seasonality and noise in time series data. More formally, we can describe those three as:
Trend: The trend component makes changes in the overall time series.
Seasonality: Change in time series value within a given time.
Noise/Random: Abrupt change in time series excluding the change produced by seasonality and trends.
In the above image, we can see how seasonality, trend and noise affect the whole observation of an additive time series data set. Interaction of those three in a dataset determines the type of time series data.
Additive Time Series: In a time series, trend, seasonality, and noise make the additive time series.
- Time-Series = trend + seasonality + noise
Multiplicative Time Series: Multiplication of trend, seasonality, and noise make the time series multiplicative.
- Time-series = trend × seasonality × noise
Here in the above, we have seen the basics of a time series data set. We will discuss the seasonality of a data set and how to deseasonalize time series in the next step.
In a time series, seasonality is a component that tells us the changes or fluctuations are occurring in a repeated way for similar periods. For example, sales of umbrellas increase in the rainy season; it increases because rain can happen only once a year but will happen every year; hence we can say that there is a seasonality effect in the sales of umbrellas.
A cyclic structure in a data set can be seasonality if the frequency of the trend graph is increasing or decreasing repeatedly but for a particular time.
Understanding seasonality can improve the forecasting results. However, to make a clear relationship between the input and output some time we need to remove the seasonality. Removal of seasonality is called deseasonalizing time series.
Many types of seasonality depend on the time series and frequency of fluctuations. Like
- Time of the day
After removal of seasonality from time series, we can consider it as a seasonal stationary time series.
For learning about deseasonalizing I am using airline passenger data set. In the data set, we have the records for passenger count of every month from 1949 to 1959. The data is having both trend and seasonality. We are going to remove the seasonality in the next steps.
Code Implementation of Deseasonalizing Time Series
Setting up the environment in google colab.
Python 3.6 or above,
Importing the basic libraries :
import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline
Reading the data set:
data = pd.read_csv("/content/drive/MyDrive/Yugesh/deseasonalizing time series/AirPassengers.csv", index_col=0, parse_dates=True) data.head()
Here we can see that the data set has the month column as index column and count of passengers in column 0.
Let’s check for the trend graph of the dataset.
The dataset trend shows that it is a kind of additive time series, but this feature dataset also has a slight seasonality and trend.
Let’s check for the two consecutive years trend to know the similarity more accurately. I am choosing the years 1957 and 1958 for the test.
Here for the years 1957 and 1958, we can see that the amplitude of the trends is quite similar. Just a small amount of passenger count over the year has increased, but we would know that there is a seasonality effect in passenger count if we draw a trend.
Knowing it better, we can decompose the data set into its components(seasonality, trend and noise) to decompose the data ‘statsmodels’ package has provided a function ‘seasonal_decompose’ under ‘statsmodels.tsa.seasonal’ module.
Importing seasonal_decompose :
from statsmodels.tsa.seasonal import seasonal_decompose
Let’s check for the components :
decompose_data = seasonal_decompose(data, model="additive") decompose_data.plot();
Here in the above chart, we can see the decomposed structure of data and the structure of the components in the data set which were affecting it.
Let’s make a graph for available seasonality.
In the seasonality graph, we can see the seasonality structure for every year, which is cyclic and repeatedly providing the same value.
To check for the stationarity of the time series, statsmodels provides a plot_acf method to plot an autocorrelation plot.
from statsmodels.graphics.tsaplots import plot_acf plot_acf(data);
Here the blue area is the confidence interval, and the candles started coming inside after the 13th candle. This can be due to the seasonality of 12-13 months.
We can cross-check it by dicky-fuller method. For more information about the test, you can visit this link.
The statsmodels provides a function to perform the test.
Importing function to perform the test:
from statsmodels.tsa.stattools import adfuller
Testing the data set by dicky-fuller method:
dftest = adfuller(data.Passengers, autolag = 'AIC') print("1. ADF : ",dftest) print("2. P-Value : ", dftest) print("3. Num Of Lags : ", dftest) print("4. Num Of Observations Used For ADF Regression and Critical Values Calculation :", dftest) print("5. Critical Values :") for key, val in dftest.items(): print("\t",key, ": ", val)
Here in the output, we can see that the p-value of the data set is more than 0.05. Because of this reason, only we can interpret the data as non-stationary.
As we have seen that data is non-stationary, we can apply deseasonalization to the data set to make it more stable or stationary.
Let’s perform the deseasonalization on the data set.
Differencing over log-transformed time-series
We try to normalize the seasonality value by the difference of log to passenger count and shifted the log value of passenger count to one step.
log_passengers = pd.DataFrame(data.Passengers.apply(lambda x : np.log(x))) log_diff = log_passengers - log_passengers.shift() ax1 = plt.subplot() log_diff.plot(title='after log transformed & differencing'); ax2 = plt.subplot() data.plot(title='original');
In the output, we can compare the trend of the graph after deseasonalizing the data.
Let’s check for the p-value of the new time series.
test = adfuller(log_diff.dropna().Passengers) print("p-value :", test)
The p-value is again greater than 0.05, so we can interpret the data as still non-stationary.
Differencing over power-transformed time series
We have first power transformed the data and then made a difference between power transformed data and one shift.
powered_transform = data.Passengers.apply(lambda x : x ** 0.5) powered_transform_diff = powered_transform - powered_transform.shift() ax1 = plt.subplot() powered_transform_diff.plot(title='after power transformed & differencing'); ax2 = plt.subplot() data.plot(title='original');
After this, we can check the p-value using dicky – fuller test.
test = adfuller(powered_transform_diff.dropna().Passengers) print("p-value :", test)
Here in differencing overpower transformed time series, we have got a good p-value near about 0.02 and lower than 0.05 in that we can consider over data is stationary. Still, there are some more methods let’s just check for the result on those methods also.
Differencing over rolling mean taken for 12 months:
rolling_mean = data.rolling(window = 12).mean() rolling_mean_diff = rolling_mean - rolling_mean.shift() ax1 = plt.subplot() powered_transform_diff.plot(title='after rolling mean & differencing'); ax2 = plt.subplot() data.plot(title='original');
Let’s check for the p-value using the dicky-fuller method.
test = adfuller(rolling_mean_diff.dropna().Passengers) print("p-value :", test)
Here we can see that the p_value is again less than 0.05. It means by the different methods; we are improving the stationarity of the dataset.
Differencing over log-transformed & mean rolled time series:
In this, we have applied the difference between the log transformation of the rolling mean and its shifted value by one step.
Let’s check for the results.
logged_transform = pd.DataFrame(data.Passengers.apply(lambda x : np.log(x))) rolling_mean = logged_transform.rolling(window = 12).mean() diff = rolling_mean - rolling_mean.shift(1) ax1 = plt.subplot() diff.plot(title='after log transformed rolling mean & differencing'); ax2 = plt.subplot() data.plot(title='original');
We can see that it has distorted the seasonality; it can be interpreted as this method is not as good as the other methods were.
Let’s check for the p-value.
test = adfuller(diff.dropna().Passengers) print("p-value :", test)
As assumed, the p-value of the dataset is greater than 0.05. The dataset is not stationary.
Differencing over power transformed & rolling mean time series
This method will try to adjust seasonality using the difference between power transformed rolling mean and shifted by one step of power transformed rolling mean of data.
Let’s check for the results.
powered_transform = pd.DataFrame(data.Passengers.apply(lambda x : x ** 0.5)) rolling_mean = powered_transform.rolling(window = 12).mean() diff = rolling_mean - rolling_mean.shift(1) ax1 = plt.subplot() diff.plot(title='after power transformed rolling mean & differencing'); ax2 = plt.subplot() data.plot(title='original');
In the output, we again see the distortion in the seasonality. let’s check for the p-value
test = adfuller(diff.dropna().Passengers) print("p-value :", test)
The p-value is one of the best we are having, but graph seasonality was not good; we interpreted the trend component as highly available after data after seasonality. So to improve it more we can go for detrending also.
We have seen our data set was pretty clean and was in an ideal condition, but talking about the real world problem, the datasets do not behave like this generally. So we need to perform more and more tasks on the data set to make our predictions more accurate.
In this article, we discussed the time series, had a basic overview of components of a time series, and performed differencing methods for deseasonalizing the time series data to obtain accuracy in our further modeling process.
All the information in this post is gathered from:
- Pandas timestamp data basics
- Statsmodels introduction and modules
- Google colab for python codes