###### Comprehensive Guide To Time Series Analysis Using ARIMA # Comprehensive Guide To Time Series Analysis Using ARIMA Time series data is a set of observations collected through repeated measurements over time. Plotting the points on a graph, one of our axes would always be time. Time series data is everywhere since time is a constituent of everything observable. As the world advances every day with technology, sensors and systems constantly produce a relentless stream of time series data. Data of this nature has numerous applications across a variety of industries. Time series data is data that is observed at different points in time.

This is opposite to cross-sectional data, which observes individuals, companies, etc., at a single point in time.  Time-series databases are highly popular and provide a wide spectrum of numerous applications such as stock market analysis, economic and sales forecasting, budget analysis, to name a few. They are also useful for studying natural phenomena like atmospheric pressure, temperature, wind speeds, earthquakes, and medical prediction for treatment. An observed time series can be decomposed into three main components: the trend i.e the long cycle, the seasonal systematic or calendar related movements, and the irregular unsystematic or short term fluctuations. To find insights from these components, time series analysis can be made use of.

`Deep Learning DevCon 2021 | 23-24th Sep | Register>>`

Time Series Analysis finds hidden patterns and helps obtain useful insights from the time series data. Time Series Analysis is useful in predicting future values or detecting anomalies from the data.  Such analysis typically requires many data points to be present in the dataset to ensure consistency and reliability. An extensive data set ensures that you have a representative sample size and that the analysis performed can cut through the noisy data. It helps organizations understand what the underlying causes of trends or systemic patterns detected over time are. Using data visualization, better and interpretable insights can be found that can show seasonal trends and help dig deeper into why these trends occur.

The different types of models and analyses that can be created through time series analysis are:

• Classification: To Identify and assign categories to the data.
• Curve fitting: Plot the data along a curve and study the relationships of variables present within the data.
• Descriptive analysis: Help Identify certain patterns in time-series data such as trends, cycles, or seasonal variation.
• Explanative analysis: To understand the data and its relationships, the dependent features, and cause and effect and its tradeoff.
• Exploratory analysis: Describe and focus on the main characteristics of the time series data, usually in a visual format.
• Forecasting: Predicting future data based on historical trends. Using the historical data as a model for future data and predicting scenarios that could happen along with the future plot points.
• Intervention analysis: The Study of how an event can change the data.
• Segmentation: Splitting the data into segments to discover the underlying properties from the source information.

As time-series analysis includes many categories and data variations, the analysts sometimes have to deal with and create complex models. Although, analysts can’t account for all variances, and sometimes one can’t generalize the same specific model to every sample. If created too complex or trying to do too many things, models can lead to a lack of fit. The Lack of fit or an overfitting model leads to the model not distinguishing between a random error and true relationship, leaving analysis biased, partially skewed, and forecasts being incorrect.

`Follow us on Google News>>`

## What is ARIMA?

A popular and very widely used statistical method for time series forecasting and analysis is the ARIMA model. ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a class of models that capture a spectrum of different standard temporal structures present in time series data. By implementing an ARIMA model, you can forecast and analyze a time series using past values, such as predicting future prices based on historical earnings. Univariate models such as these are used to understand better a single time-dependent variable present in the data, such as temperature over time. They predict future data points of and from the variables. These models work on the ideal assumption of the data being stationary. A standard notation used for describing ARIMA is by parameters p,d and q. The parameters are substituted with an integer value to indicate the specific ARIMA model being used quickly.

The parameters of the ARIMA model are further described as follows:

• p: Stands for the number of lag observations included in the model, also known as the lag order.
• d: The number of times the raw observations are differentiated, also called the degree of differencing.
• q: Is the size of the moving average window and also called the order of moving average.

## Getting Started

This article will implement an ARIMA model from scratch to create a Time Series Forecasting Analysis. We will be using the “pmdarima” library, a statistical library in Python that increases its time series analysis capabilities. We will be analysing and predicting the future temperatures from the dataset used. You can download the dataset from here to get started. The following implementation is also partially inspired by a video tutorial for time series forecasting, which can be accessed from the link here

##### Installing The Library

First, we will install the pmdarima library, which will boost our analysis and help us create a defined forecasting model. To install the library, the following code can be implemented,

``` #installing the library
!pip install pmdarima ```
##### Installing required Dependencies

Next, we will install the required dependencies for our model; here we are using pandas and NumPy for analysis.

``` #installing dependencies
import pandas as pd
import numpy as np ```

We will now start with reading the data and creating a data frame for it. The present data has five columns, namely: MinTemp, MaxTemp, AvgTemp, Sunrise, Sunset. We will be focusing on the AvgTemp column and predicting future temperatures from it, and setting Date as our index.

``` #reading the data
#drpping null values
df=df.dropna()
print('Shape of data',df.shape)

Output :

##### Data Preprocessing and Initial Analysis

Plotting the data to check how it is,

``` #plotting the data
df['AvgTemp'].plot(figsize=(12,5)) ```

Running a statistical analysis test known as the dickey-fuller test to check if the data is stationary or not. We will judge based on the p-value received from the test.

``` #applying dickey-fuller test
#creating a function for values
dftest = adfuller(dataset, autolag = 'AIC')
print("2. P-Value : ", dftest)
print("3. Num Of Lags : ", dftest)
print("4. Num Of Observations Used For ADF Regression and Critical Values Calculation :", dftest)
print("5. Critical Values :")
for key, val in dftest.items():
print("\t",key, ": ", val)
#printing for AvgTemp

Output :

``` 1. ADF :  -6.554680125068777
2. P-Value :  8.675937480199653e-09
3. Num Of Lags :  12
4. Num Of Observations Used For ADF Regression and Critical Values Calculation : 1808
5. Critical Values :
1% :  -3.433972018026501
5% :  -2.8631399192826676
10% :  -2.5676217442756872 ```

With the observed p-value, we can state that the data is stationary.

##### Creating Our Arima Model
``` #creating our ARIMA Model
from pmdarima import auto_arima
# Ignore harmless warnings
import warnings
warnings.filterwarnings("ignore")
Calling our model and generating best possible ARIMA combination,
#calling our function
stepwise_fit = auto_arima(df['AvgTemp'],suppress_warnings=True)
stepwise_fit.summary() ```

Output :

` from statsmodels.tsa.arima_model import ARIMA `

Splitting the model into Train And Test, assigning the last 30% as the testing and the rest as the training data.

``` #splitting into train and test
print(df.shape)
train=df.iloc[:-30]
test=df.iloc[-30:]
print(train.shape,test.shape)
print(test.iloc,test.iloc[-1]) ```

Output :

``` (1821, 5)
(1791, 5) (30, 5)
MinTemp      36.0
MaxTemp      52.0
AvgTemp      44.0
Sunrise     640.0
Sunset     1743.0
Name: 2018-12-01 00:00:00, dtype: float64 MinTemp      39.0
MaxTemp      52.0
AvgTemp      46.0
Sunrise     656.0
Sunset     1754.0
Name: 2018-12-30 00:00:00, dtype: float64
Training the ARIMA Model,
#model Training
from statsmodels.tsa.arima_model import ARIMA
model=ARIMA(train['AvgTemp'],order=(1,0,5))
model=model.fit()
model.summary() ```

Results :

##### Making Predictions on Test Set

Plotting the predictions,

``` start=len(train)
end=len(train)+len(test)-1
pred=model.predict(start=start,end=end,typ='levels').rename('ARIMA predictions')
#pred.index=index_future_dates
pred.plot(legend=True)
test['AvgTemp'].plot(legend=True) ```
``` #knowing the mean AvgTemp
test['AvgTemp'].mean()
45.0 ```

Our model here seems to predict the trend well from the data.

Calculating the mean squared error to check how our model has performed. If the root means the squared error is close to the mean derived, it will be termed a bad model.

``` from sklearn.metrics import mean_squared_error
from math import sqrt
rmse=sqrt(mean_squared_error(pred,test['AvgTemp']))
print(rmse)
3.000495429601031
Printing the last five values to see on what date the dataset has its end.
#checking data end date
model2=ARIMA(df['AvgTemp'],order=(1,0,5))
model2=model2.fit()
df.tail() ```
##### Making Future Predictions

Printing future predictions for the next 30 days from the end date,

``` #printing predictions for next 30 days
index_future_dates=pd.date_range(start='2018-12-30',end='2019-01-29')
#print(index_future_dates)
pred=model2.predict(start=len(df),end=len(df)+30,typ='levels').rename('ARIMA Predictions')
#print(comp_pred)
pred.index=index_future_dates
print(pred) ```

Output :

``` 2018-12-30    46.418064
2018-12-31    46.113783
2019-01-01    45.617772
2019-01-02    45.249555
2019-01-03    45.116984
2019-01-04    45.136771
2019-01-05    45.156280
2019-01-06    45.175516
2019-01-07    45.194482
2019-01-08    45.213183
2019-01-09    45.231622
2019-01-10    45.249802
2019-01-11    45.267728
2019-01-12    45.285403
2019-01-13    45.302830
2019-01-14    45.320012
2019-01-15    45.336955
2019-01-16    45.353659
2019-01-17    45.370130
2019-01-18    45.386370
2019-01-19    45.402383
2019-01-20    45.418171
2019-01-21    45.433738
2019-01-22    45.449087
2019-01-23    45.464221
2019-01-24    45.479143
2019-01-25    45.493855
2019-01-26    45.508362
2019-01-27    45.522665
2019-01-28    45.536769
2019-01-29    45.550674
Freq: D, Name: ARIMA Predictions, dtype: float64 ```

Plotting Graph for future predictions,

## EndNotes

This article has tried to explore an ARIMA model and how time series analysis can be taught with the model. We also discussed the different aspects of time series analysis and the necessary steps to create a complete-time series model. You can implement the same on different datasets and see how the complexity varies. The colab notebook for the above implementation can be found here.

Happy Learning!

## References

What Do You Think?