Comprehensive Guide To Time Series Analysis Using ARIMA

Time series data is a set of observations collected through repeated measurements over time. Plotting the points on a graph, one of our axes would always be time. Time series data is everywhere since time is a constituent of everything observable. As the world advances every day with technology, sensors and systems constantly produce a relentless stream of time series data. Data of this nature has numerous applications across a variety of industries. Time series data is data that is observed at different points in time. 

This is opposite to cross-sectional data, which observes individuals, companies, etc., at a single point in time.  Time-series databases are highly popular and provide a wide spectrum of numerous applications such as stock market analysis, economic and sales forecasting, budget analysis, to name a few. They are also useful for studying natural phenomena like atmospheric pressure, temperature, wind speeds, earthquakes, and medical prediction for treatment. An observed time series can be decomposed into three main components: the trend i.e the long cycle, the seasonal systematic or calendar related movements, and the irregular unsystematic or short term fluctuations. To find insights from these components, time series analysis can be made use of. 

Time Series Analysis finds hidden patterns and helps obtain useful insights from the time series data. Time Series Analysis is useful in predicting future values or detecting anomalies from the data.  Such analysis typically requires many data points to be present in the dataset to ensure consistency and reliability. An extensive data set ensures that you have a representative sample size and that the analysis performed can cut through the noisy data. It helps organizations understand what the underlying causes of trends or systemic patterns detected over time are. Using data visualization, better and interpretable insights can be found that can show seasonal trends and help dig deeper into why these trends occur. 

The different types of models and analyses that can be created through time series analysis are:

  • Classification: To Identify and assign categories to the data.
  • Curve fitting: Plot the data along a curve and study the relationships of variables present within the data.
  • Descriptive analysis: Help Identify certain patterns in time-series data such as trends, cycles, or seasonal variation.
  • Explanative analysis: To understand the data and its relationships, the dependent features, and cause and effect and its tradeoff.
  • Exploratory analysis: Describe and focus on the main characteristics of the time series data, usually in a visual format.
  • Forecasting: Predicting future data based on historical trends. Using the historical data as a model for future data and predicting scenarios that could happen along with the future plot points.
  • Intervention analysis: The Study of how an event can change the data.
  • Segmentation: Splitting the data into segments to discover the underlying properties from the source information.

As time-series analysis includes many categories and data variations, the analysts sometimes have to deal with and create complex models. Although, analysts can’t account for all variances, and sometimes one can’t generalize the same specific model to every sample. If created too complex or trying to do too many things, models can lead to a lack of fit. The Lack of fit or an overfitting model leads to the model not distinguishing between a random error and true relationship, leaving analysis biased, partially skewed, and forecasts being incorrect.

What is ARIMA? 

A popular and very widely used statistical method for time series forecasting and analysis is the ARIMA model. ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a class of models that capture a spectrum of different standard temporal structures present in time series data. By implementing an ARIMA model, you can forecast and analyze a time series using past values, such as predicting future prices based on historical earnings. Univariate models such as these are used to understand better a single time-dependent variable present in the data, such as temperature over time. They predict future data points of and from the variables. These models work on the ideal assumption of the data being stationary. A standard notation used for describing ARIMA is by parameters p,d and q. The parameters are substituted with an integer value to indicate the specific ARIMA model being used quickly.

The parameters of the ARIMA model are further described as follows:

  • p: Stands for the number of lag observations included in the model, also known as the lag order.
  • d: The number of times the raw observations are differentiated, also called the degree of differencing.
  • q: Is the size of the moving average window and also called the order of moving average.

Getting Started 

This article will implement an ARIMA model from scratch to create a Time Series Forecasting Analysis. We will be using the “pmdarima” library, a statistical library in Python that increases its time series analysis capabilities. We will be analysing and predicting the future temperatures from the dataset used. You can download the dataset from here to get started. The following implementation is also partially inspired by a video tutorial for time series forecasting, which can be accessed from the link here

Installing The Library

First, we will install the pmdarima library, which will boost our analysis and help us create a defined forecasting model. To install the library, the following code can be implemented,

 #installing the library
 !pip install pmdarima 
Installing required Dependencies

Next, we will install the required dependencies for our model; here we are using pandas and NumPy for analysis.

 #installing dependencies
 import pandas as pd
 import numpy as np 
Reading The Data

We will now start with reading the data and creating a data frame for it. The present data has five columns, namely: MinTemp, MaxTemp, AvgTemp, Sunrise, Sunset. We will be focusing on the AvgTemp column and predicting future temperatures from it, and setting Date as our index.

 #reading the data
 df=pd.read_csv('/content/MaunaLoaDailyTemps.csv',index_col='DATE',parse_dates=True)
 #drpping null values
 df=df.dropna()
 print('Shape of data',df.shape)
 df.head() 

Output : 

Data Preprocessing and Initial Analysis

Plotting the data to check how it is,

 #plotting the data
 df['AvgTemp'].plot(figsize=(12,5)) 

Running a statistical analysis test known as the dickey-fuller test to check if the data is stationary or not. We will judge based on the p-value received from the test.

 #applying dickey-fuller test
 from statsmodels.tsa.stattools import adfuller
 #creating a function for values 
 def adf_test(dataset):
   dftest = adfuller(dataset, autolag = 'AIC')
   print("1. ADF : ",dftest[0])
   print("2. P-Value : ", dftest[1])
   print("3. Num Of Lags : ", dftest[2])
   print("4. Num Of Observations Used For ADF Regression and Critical Values Calculation :", dftest[3])
   print("5. Critical Values :")
   for key, val in dftest[4].items():
       print("\t",key, ": ", val)
 #printing for AvgTemp
 adf_test(df['AvgTemp']) 

Output :

 1. ADF :  -6.554680125068777
 2. P-Value :  8.675937480199653e-09
 3. Num Of Lags :  12
 4. Num Of Observations Used For ADF Regression and Critical Values Calculation : 1808
 5. Critical Values :
 1% :  -3.433972018026501
 5% :  -2.8631399192826676
 10% :  -2.5676217442756872 

With the observed p-value, we can state that the data is stationary. 

Creating Our Arima Model
 #creating our ARIMA Model
 from pmdarima import auto_arima
 # Ignore harmless warnings
 import warnings
 warnings.filterwarnings("ignore")
 Calling our model and generating best possible ARIMA combination,
 #calling our function
 stepwise_fit = auto_arima(df['AvgTemp'],suppress_warnings=True)           
 stepwise_fit.summary() 

Output :

 from statsmodels.tsa.arima_model import ARIMA 

Splitting the model into Train And Test, assigning the last 30% as the testing and the rest as the training data. 

 #splitting into train and test
 print(df.shape)
 train=df.iloc[:-30]
 test=df.iloc[-30:]
 print(train.shape,test.shape)
 print(test.iloc[0],test.iloc[-1]) 

Output :

 (1821, 5)
 (1791, 5) (30, 5)
 MinTemp      36.0
 MaxTemp      52.0
 AvgTemp      44.0
 Sunrise     640.0
 Sunset     1743.0
 Name: 2018-12-01 00:00:00, dtype: float64 MinTemp      39.0
 MaxTemp      52.0
 AvgTemp      46.0
 Sunrise     656.0
 Sunset     1754.0
 Name: 2018-12-30 00:00:00, dtype: float64
 Training the ARIMA Model, 
 #model Training
 from statsmodels.tsa.arima_model import ARIMA
 model=ARIMA(train['AvgTemp'],order=(1,0,5))
 model=model.fit()
 model.summary() 

Results :

Making Predictions on Test Set

Plotting the predictions,

 start=len(train)
 end=len(train)+len(test)-1
 pred=model.predict(start=start,end=end,typ='levels').rename('ARIMA predictions')
 #pred.index=index_future_dates
 pred.plot(legend=True)
 test['AvgTemp'].plot(legend=True) 
 #knowing the mean AvgTemp
 test['AvgTemp'].mean()
 45.0 

Our model here seems to predict the trend well from the data. 

Calculating the mean squared error to check how our model has performed. If the root means the squared error is close to the mean derived, it will be termed a bad model.

 from sklearn.metrics import mean_squared_error
 from math import sqrt
 rmse=sqrt(mean_squared_error(pred,test['AvgTemp']))
 print(rmse)
 3.000495429601031
 Printing the last five values to see on what date the dataset has its end.
 #checking data end date
 model2=ARIMA(df['AvgTemp'],order=(1,0,5))
 model2=model2.fit()
 df.tail() 
Making Future Predictions

Printing future predictions for the next 30 days from the end date, 

 #printing predictions for next 30 days
 index_future_dates=pd.date_range(start='2018-12-30',end='2019-01-29')
 #print(index_future_dates)
 pred=model2.predict(start=len(df),end=len(df)+30,typ='levels').rename('ARIMA Predictions')
 #print(comp_pred)
 pred.index=index_future_dates
 print(pred) 

Output :

 2018-12-30    46.418064
 2018-12-31    46.113783
 2019-01-01    45.617772
 2019-01-02    45.249555
 2019-01-03    45.116984
 2019-01-04    45.136771
 2019-01-05    45.156280
 2019-01-06    45.175516
 2019-01-07    45.194482
 2019-01-08    45.213183
 2019-01-09    45.231622
 2019-01-10    45.249802
 2019-01-11    45.267728
 2019-01-12    45.285403
 2019-01-13    45.302830
 2019-01-14    45.320012
 2019-01-15    45.336955
 2019-01-16    45.353659
 2019-01-17    45.370130
 2019-01-18    45.386370
 2019-01-19    45.402383
 2019-01-20    45.418171
 2019-01-21    45.433738
 2019-01-22    45.449087
 2019-01-23    45.464221
 2019-01-24    45.479143
 2019-01-25    45.493855
 2019-01-26    45.508362
 2019-01-27    45.522665
 2019-01-28    45.536769
 2019-01-29    45.550674
 Freq: D, Name: ARIMA Predictions, dtype: float64 

Plotting Graph for future predictions, 

EndNotes

This article has tried to explore an ARIMA model and how time series analysis can be taught with the model. We also discussed the different aspects of time series analysis and the necessary steps to create a complete-time series model. You can implement the same on different datasets and see how the complexity varies. The colab notebook for the above implementation can be found here. 

Happy Learning!

References

Download our Mobile App

Victor Dey
Victor is an aspiring Data Scientist & is a Master of Science in Data Science & Big Data Analytics. He is a Researcher, a Data Science Influencer and also an Ex-University Football Player. A keen learner of new developments in Data Science and Artificial Intelligence, he is committed to growing the Data Science community.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week.