For years we have been generating data and it is subject to change as time passes. It becomes very important for researchers to pay attention to the trends in the past to make accurate predictions for the upcoming series of events. A time series is where the time is the independent variable.
Considering the example of stock markets, due to its volatility, it becomes very important for analysts and investors to have access to the most accurate figures to park their money. As simple as it may sound, it is equally challenging. No one knows what news might shoot the price to the top and what could smash it down.
Sign up for your weekly dose of what's up in emerging technology.
This article will help you understand some basics that need to be understood before stepping into predictive modeling for forecasting.
Below given properties influence the modeling process significantly.
Autocorrelation is a mathematical representation of the degree of similarity between a given time series and the lagged version of itself over successive time intervals. In other words, instead of calculating the correlation between two different series, we calculate the correlation of the series with an “x” unit lagged version (x∈N) of itself.
Autocorrelation is also known as lagged correlation or serial correlation. The value of autocorrelation varies between +1 & -1. If the autocorrelation of series is a very small value that does not mean, there is no correlation. The correlation could be non-linear. Let us understand by a hand calculated example.
Consider the above dataset. The data represents the monthly sales for 2 years and the lagged versions of itself. For the sake of this example, we have a 3 unit lagged series. We are going to use the below given formula to calculate the autocorrelation for the time series.
- r(t) = The time series sorted in ascending order
- r(t-k) = The same time series as above but shifted by K units (in our case k=3)
- r_bar = Average of the original time series
Now we’ll sort the data into ascending order which will look like this.
Start by calculating the average of the original data (first column) which is r(t) = 18.1933. Now calculate the denominator of the formula which is the sum of the squared deviations of the original data = 1455.431.
We now have the values of the denominator and the average of the original dataset (r(t)). Let’s calculate the numerator. Subtract the average of the original data from the first column starting from the fourth row and store it in a separate column. Do the same with the fourth column (3 – unit lag column) and store the output in a separate column.
Once you have both the columns in place, multiply them, find the total of it and divide it by the sum of squared deviations of the original dataset which will be 647.8286/1455.431 = 0.46 (approx.). This is the value of your autocorrelation which could indicate seasonality for a time interval of 3 months. Having done that, let’s see the easier way to do all the calculation in python.
This is how our data looks. Now let’s see how to visualize autocorrelation with unit lag = 3 in python.
Seasonality in time series data means periodic fluctuations. It is often considered when the graph of the time series resembles a sinusoidal shape, which means that the graph looks like a sine function or shows repetitions after every fixed interval of time. This repetition interval is known as your period. Seasonality means the data shows a repetitive structure every one-year.
There is a difference between cyclic and seasonal data. When talking about cyclic data, the period can be of variable lengths like 2 days, 2 months, 2 years etc. but if you are meant to find seasonality then it has to be over a period of one calendar year or financial year. It is important for companies that have been in the market for a long time. Let’s understand seasonality by a small handwritten example.
Consider the above example which shows three-year sales of a company per quarter. Let’s calculate seasonality index for this data.
- Calculate the average sales for 2003, 2004 and 2005.
Average sales for year 2003 = 68.5
Average sales for year 2004 = 73.5
Average sales for year 2005 = 76.5
- Now divide the values of quarters by their respective year averages.
If we had data only for the year 2003 then the values in the first row would be our seasonal indexes, but we have data for 3 consecutive years so we’ll have to take the average of all of them which will result in the seasonal indexes.
- Calculating the seasonal index by taking averages of the columns in the above table.
These are our seasonal indexes. To check whether or not we have arrived to the correct values, add all of them and the sum should be equal to 4. In our example the value should be equal to 4 but if the data had sales for 12 months then the value of the indexes should sum up to 12.
The below graph shows how our original data looks like.
- Now let’s deseasonalize (reducing noise) our data. Take the original data and divide the data point in each column by their respective seasonal indexes.
This is our deseasonalized data. Let’s see how this compares to the original data by plotting them on the same graph.
We can see that the data looks much smoother and this transformation makes it easy to identify trends in the data.
- If you wish you to go back to the original data then,
original_data = deseasonalized_data*index
There are 3 ways to spot seasonality:
- Looking at it
- 3 median method
- Least squares method
If we were to predict the deseasonalized sales for the first quarter of 2006 using least squares method, we’ll take all the values, put them in a calculator and find the regression equation.
y = 0.78x + 67.68
x takes quarters as input and while doing regression the quarters are represented by 1,2,3,…… so the first quarter of 2006 means x=13. This will give us 77.82.
The entire process can be replicated in python by following the steps.
Importantly, a time series where the seasonal component has been removed is called seasonal stationary. A time series with a clear seasonal component is referred to as non-stationary.
This is one of the most important characteristics of time series data. A time series is said to be stationary if it has constant mean, variance and the covariance is independent of time. In ideal situations we would prefer a stationary series, but in real world, that’s not the case. There are different types of stationary time series as follows:
- Stationary process: A process that generates a stationary series of observations
- Stationary model: A model that describes a stationary series of observations
- Trend stationary: A time series does not show a trend
- Seasonal stationery: A time series does not depict seasonality
- Strictly stationary: A mathematical definition of a stationary process, specifically that the joint distribution of observations is invariant to time shift.
Identifying stationarity in the time series can be tricky at times. There are multiple ways to deal with it.
- Looking at the plots:
By far the easiest and most straightforward method to decide whether the series is stationary or non-stationary.
- Summary statistics:
This could be a cheat way to identify stationarity in the series. To put it into simple words, break down the series into two or more than two parts and compare the mean and variance among all the parts. If they happen to be similar, then the series is stationary.
Let’s take the classic example of the birth and airline passenger csv dataset.
Looking at the histogram we can observe that the data forms a bell curve resulting in a normal distribution. Now following our summary statistics approach, let’s divide the dataset into two parts. Since the length of the dataset is an odd number, we’ll round the result of division by 2, to make two parts of the series.
Looking at the values we could say that the series is stationary as the values are quite close to each other, but if we take the example of the classic airline data then you’ll see that the values are quite far from each other.
Here one thing is quite evident, that the graphs don’t show stationarity, but if we visit the time series plot for airline passengers again, we can see exponential seasonal growth. To flatten out this growth we could do a log transform on all the values, plot the transformed values and find mean and variance. Now the values will be pretty close to each other.
- Statistical tests:
The most famous one is the Augmented Dickey-Fuller test (ADF). It is also called a unit root test. ADF uses an autoregressive model and optimizes an information criterion across multiple different lag values. The simple idea behind this is looking at p-value. If the p-value is <=0.05 then we reject the null hypothesis as the data does not have a unit root and is stationary.
- Higher negative value of ADF Statistic means we can reject the null hypothesis.
- The p-value is also less than 0.05 which means the series is stationary and we can reject the null hypothesis.
- ADF statistic value is less than value of -3.499 at 1%. This means we can reject the null hypothesis with a significance level of less than 1%.
The above described are the absolute basics of time series forecasting and are helpful while modeling time series data. Being able to interpret the graphs and the test values will help you select the most appropriate model for your data.