Vector autoregression (VAR) is a statistical model for multivariate time series analysis, especially in a time series where the variables have a relationship that affects each other to time. VAR models are different from univariate autoregressive models because they allow analysis and make predictions on multivariate time series data. VAR models are often used in economics and weather forecasting.
Basic requirements to use the VAR model are :
Register for FREE Workshop on Data Engineering>>
- Time series with at least two variables.
- Relationship between variables.
It is considered an autoregressive model because the predictions made by the model are dependent on the past values, which means that each observation is modelled as the function of its lagged value.
The basic difference between the ARIMA family and VAR models is that all the ARIMA models are used for univariate time series, where the VAR models work with multivariate time series. In addition, ARIMA models are unidirectional models, which means that the dependent variables are influenced by their past or lag values itself, where VAR is a bi-directional model, which means a dependent variable is affected by its past value or by another variable’s value or influenced by both of the things.
For more understanding about the time-series, please refer to these article:-
- General Overview Of Time Series Data Analysis.
- Comprehensive Guide To Deseasonalizing Time Series.
- How To Apply Smoothing Methods In Time Series Analysis.
- Guide To AC and PAC Plots In Time Series
To learn more about the time-series modeling, please refer to these articles:-
- Comprehensive Guide To Time Series Analysis Using ARIMA.
- Complete Guide To SARIMAX in Python for Time Series Modeling.
- Tutorial on Univariate Single-Step Style LSTM in Time Series Forecasting.
This article is going to be another guide for time-series modeling, but this time it will be with multivariate time series data. Also, we will go through some tests which are required to understand the multivariate time series. Before starting the modeling part, let’s go through the mathematics behind the model.
What is Vector Autoregression(VAR)?
A typical autoregression model(AR(p)) for univariate time series can be represented by
Where
- y_{t}_{−i} indicates the variable value at periods earlier.
- A_{i} is a time-invariant (k × k)-matrix.
- e_{t} is an error term.
- c is an intercept of the model.
Here the order p means, up to p-lags of y is used.
As we know, the VAR model deals with multivariate time series, which means there will be two or more variables affecting each other. Therefore, the VAR model equation increases with the number of variables in the time series.
Let’s suppose there are two time-series variables, y1 and y2, so to calculate y1(t), the VAR model will use the lags of both time-series variables.
For example, the equation for the VAR(1) model with two time-series variables (y1 and y2) will look like this:
Where, Y{1,t-1} is the first lag value for yy1 and Y{2,t-1} is the first lag value for y2.
And the VAR(2) with y1 and y2 time series variables, the equation of the model will look like :
Here we can clearly understand how the model’s equation will increase with variables and the lag values. For example, a VAR(3) model equation with 3 time-series variables will look like.
So this is how the p-value will increase the length of the model equation, and the number of variables will increase the height of the equation.
Implementing Vector Autoregression(VAR) in Python
Let’s build a basic VAR model using python.
To build the model, we can use python’s statsmodel package, which provides most of the module to work on time series analysis and p[rovides some data with the package to practice on the time series analysis.
Importing libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.tsa.api import VAR
from statsmodels.tsa.base import datetools
Importing the data
mdata = sm.datasets.macrodata.load_pandas().data
mdata.head()
Output:
Here we can see how our data looks. Here I am considering three variable real gdp real cons and realinv for further modeling processing. And also need to make a datetime value using the year and quarter columns before going for further processes.
mdata = mdata[['year','quarter']].astype(int)
mdata = mdata[['year','quarter']].astype(str)
quarterly = mdata["year"] + "Q" + mdata["quarter"]
quarterly = datetools.dates_from_str(quarterly)
quarterly
Output:
Now we can use the datetime values as the index of our data.
mata = mdata[['realgdp','realcons','realinv']]
mdata.index = pandas.DatetimeIndex(quarterly)
mdata.head()
Output:
Visualizing the data.
mdata.plot()
Output:
Here we can see that both realgdp and realcons have a high correlation, and there is a slight correlation between realinv and other variables. Because of the trend we are seeing, we can understand that all of them are steadily growing until the last point, but they all have some sort of decreament.
So we can check the pattern of time series in their log values.
data = np.log(mdata).diff().dropna()
data.plot()
Output:
Here we can see how the normalized values are being plotted in the time series. The relationship between realgdp and realcons is as strong as they follow the same line pattern, which clearly says the change in realgdp will affect realcons or vice-versa. For more analysis, we can perform some tests on the time series to statistically define the relationship between the variables.
Granger’s causality test
By using granger’s causality test, we can find the relationship between the variables before building the model because it is known that if there is no relationship between the variables, we can drop the variables and separately do the modeling. If there is a relationship between them, we need to consider the variable in the modeling part.
In mathematics, the test provides the p-value between the variables, and if the p-value is higher than 0.05 then we will be required to accept the null hypothesis, and if the p-value is lesser than 0.05 we are required to reject the null hypothesis.
Statsmodel also provides a module to perform the test, so using the statsmodel next, I am performing the granger’s causality test.
from statsmodels.tsa.stattools import grangercausalitytests
data = mdata[["realgdp", "realcons"]].pct_change().dropna()
#Performing test on for realgdp and realcons.
gc_res = grangercausalitytests(data, 12)
Output:
Here we can see that p-values for every lag are zero. So now, let’s move forward for the causality test between realgdp and real inv.
data = mdata[["realgdp", "realinv"]].pct_change().dropna()
Output:
Here we can see p values for every lag is higher than 0.05, which means we need to accept the null hypothesis. And also, we can say that a similar thing will happen if we perform the test between realcons and realinv.
data = mdata[["realcons", "realinv"]].pct_change().dropna()
Output:
Here we are getting p-values higher than before. This means that the realcons are affecting the realinv. In graphs, we were estimating that there is no significant relationship between the realinv and other values, and hence we can understand the significance of the granger’s causality test.
Cointegration test
Cointegration helps to find out the statistical connection between two or more time series. When two or more time series are cointegrated, they have a long run, statistically significant relationship.
We can perform this test using the statsmodel package.
data = mdata[["realgdp","realcons", "realinv"]].pct_change().dropna()
from statsmodels.tsa.vector_ar.vecm import coint_johansen
def cointegration_test(data, alpha=0.05):
"""Perform Johanson's Cointegration Test and Report Summary"""
out = coint_johansen(data,-1,5)
d = {'0.90':0, '0.95':1, '0.99':2}
traces = out.lr1
cvts = out.cvt[:, d[str(1-alpha)]]
def adjust(val, length= 6): return str(val).ljust(length)
# Summary
print('Name :: Test Stat > C(95%) => Signif \n', '--'*20)
for col, trace, cvt in zip(data.columns, traces, cvts):
print(adjust(col), ':: ', adjust(round(trace,2), 9), ">", adjust(cvt, 8), ' => ' , trace > cvt)
cointegration_test(data)
Output:
Here we can see the significance of the variables on the whole system.
So here we have seen how we can use these two tests; next, we can further proceed with the modeling part.
Modeling
We can directly put the preprocessed data in the VAR module for modeling purposes.
var = VAR(data)
After this step, one thing comes up in the procedure: how to select the order. One thing that is usually performed is to check for the best-fit lag value. We need to compare the different AIC(Akaike Information Criterion), BIC(Bayesian Information criterion), FPE(Focused Prediction Error) and HQIC(Hannan–Quinn information criterion). These all are the parameters that help to select the best-fit lag value.
Statsmodel provides the select_order module to analyze these values.
x= var.select_order()
x.summary()
Output:
Here we can see the minimum values in combination with the AIC, BIC, FPE and HQIC are given with the * sign. Here we can see we have that sign in the third row and the first row. Here I am choosing the third row, which means that the value of lag valueVAR(p) is three because it is suggested to go with the combinations where AIC with other parameters are generating minimums.
results = var.fit(3)
#We can check the summary of the model by.
results.summary()
Output:
Here we can see all coefficient standard error value t-test and the model’s probabilities at every lag till 3 lag. Then, at last, there is a confusion matrix that shows the correlation between the variables. In the results, we found that the correlation between realgdp and realinv is high, the relatable effect in the probability also we can cross-check for the same thing. We can also plot the model, which will be a better way to understand the model’s performance.
Visualizing the input:
results.plot();
Output:
We can also plot our forecasted values by the model.
results.plot_forecast(20);
Output:
Here we can see the results by the model in the plot for every variable. The lines for forecasted values for the next 20 steps are going in such a steady manner, so here we can see the lags we have decided are quite satisfactory and providing good results.
Here we have seen in the article how we can perform the time series modeling where the data is multivariate. We have seen in such a condition how we can understand the relationship between two time-series variables presented in one data. There can be many examples of this kind of situation. For example, the variation in the atmospheric temperature can be caused by humidity and season factors also or stock market data. I encourage you to use this model in the real world scenario datasets so that we will know things in more depth about the multivariate time series analysis.
References :
- Statsmodel.
- Google colab for implemented codes in this article.
Subscribe to our Newsletter
Get the latest updates and relevant offers by sharing your email.Join our Telegram Group. Be part of an engaging community
Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.