Time series data is a set of observations collected through repeated measurements over time. Plotting the points on a graph, one of our axes would always be time. Time series data is everywhere since time is a constituent of everything observable. As the world advances every day with technology, sensors and systems constantly produce a relentless stream of time series data. Data of this nature has numerous applications across a variety of industries. Time series data is data that is observed at different points in time.
This is opposite to cross-sectional data, which observes individuals, companies, etc., at a single point in time. Time-series databases are highly popular and provide a wide spectrum of numerous applications such as stock market analysis, economic and sales forecasting, budget analysis, to name a few. They are also useful for studying natural phenomena like atmospheric pressure, temperature, wind speeds, earthquakes, and medical prediction for treatment. An observed time series can be decomposed into three main components: the trend i.e the long cycle, the seasonal systematic or calendar related movements, and the irregular unsystematic or short term fluctuations. To find insights from these components, time series analysis can be made use of.
Time Series Analysis finds hidden patterns and helps obtain useful insights from the time series data. Time Series Analysis is useful in predicting future values or detecting anomalies from the data. Such analysis typically requires many data points to be present in the dataset to ensure consistency and reliability. An extensive data set ensures that you have a representative sample size and that the analysis performed can cut through the noisy data. It helps organizations understand what the underlying causes of trends or systemic patterns detected over time are. Using data visualization, better and interpretable insights can be found that can show seasonal trends and help dig deeper into why these trends occur.
The different types of models and analyses that can be created through time series analysis are:
- Classification: To Identify and assign categories to the data.
- Curve fitting: Plot the data along a curve and study the relationships of variables present within the data.
- Descriptive analysis: Help Identify certain patterns in time-series data such as trends, cycles, or seasonal variation.
- Explanative analysis: To understand the data and its relationships, the dependent features, and cause and effect and its tradeoff.
- Exploratory analysis: Describe and focus on the main characteristics of the time series data, usually in a visual format.
- Forecasting: Predicting future data based on historical trends. Using the historical data as a model for future data and predicting scenarios that could happen along with the future plot points.
- Intervention analysis: The Study of how an event can change the data.
- Segmentation: Splitting the data into segments to discover the underlying properties from the source information.
As time-series analysis includes many categories and data variations, the analysts sometimes have to deal with and create complex models. Although, analysts can’t account for all variances, and sometimes one can’t generalize the same specific model to every sample. If created too complex or trying to do too many things, models can lead to a lack of fit. The Lack of fit or an overfitting model leads to the model not distinguishing between a random error and true relationship, leaving analysis biased, partially skewed, and forecasts being incorrect.
What is ARIMA?
A popular and very widely used statistical method for time series forecasting and analysis is the ARIMA model. ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a class of models that capture a spectrum of different standard temporal structures present in time series data. By implementing an ARIMA model, you can forecast and analyze a time series using past values, such as predicting future prices based on historical earnings. Univariate models such as these are used to understand better a single time-dependent variable present in the data, such as temperature over time. They predict future data points of and from the variables. These models work on the ideal assumption of the data being stationary. A standard notation used for describing ARIMA is by parameters p,d and q. The parameters are substituted with an integer value to indicate the specific ARIMA model being used quickly.
The parameters of the ARIMA model are further described as follows:
- p: Stands for the number of lag observations included in the model, also known as the lag order.
- d: The number of times the raw observations are differentiated, also called the degree of differencing.
- q: Is the size of the moving average window and also called the order of moving average.
This article will implement an ARIMA model from scratch to create a Time Series Forecasting Analysis. We will be using the “pmdarima” library, a statistical library in Python that increases its time series analysis capabilities. We will be analysing and predicting the future temperatures from the dataset used. You can download the dataset from here to get started. The following implementation is also partially inspired by a video tutorial for time series forecasting, which can be accessed from the link here.
Installing The Library
First, we will install the pmdarima library, which will boost our analysis and help us create a defined forecasting model. To install the library, the following code can be implemented,
#installing the library !pip install pmdarima
Installing required Dependencies
Next, we will install the required dependencies for our model; here we are using pandas and NumPy for analysis.
#installing dependencies import pandas as pd import numpy as np
Reading The Data
We will now start with reading the data and creating a data frame for it. The present data has five columns, namely: MinTemp, MaxTemp, AvgTemp, Sunrise, Sunset. We will be focusing on the AvgTemp column and predicting future temperatures from it, and setting Date as our index.
#reading the data df=pd.read_csv('/content/MaunaLoaDailyTemps.csv',index_col='DATE',parse_dates=True) #drpping null values df=df.dropna() print('Shape of data',df.shape) df.head()
Data Preprocessing and Initial Analysis
Plotting the data to check how it is,
#plotting the data df['AvgTemp'].plot(figsize=(12,5))
Running a statistical analysis test known as the dickey-fuller test to check if the data is stationary or not. We will judge based on the p-value received from the test.
#applying dickey-fuller test from statsmodels.tsa.stattools import adfuller #creating a function for values def adf_test(dataset): dftest = adfuller(dataset, autolag = 'AIC') print("1. ADF : ",dftest) print("2. P-Value : ", dftest) print("3. Num Of Lags : ", dftest) print("4. Num Of Observations Used For ADF Regression and Critical Values Calculation :", dftest) print("5. Critical Values :") for key, val in dftest.items(): print("\t",key, ": ", val) #printing for AvgTemp adf_test(df['AvgTemp'])
1. ADF : -6.554680125068777 2. P-Value : 8.675937480199653e-09 3. Num Of Lags : 12 4. Num Of Observations Used For ADF Regression and Critical Values Calculation : 1808 5. Critical Values : 1% : -3.433972018026501 5% : -2.8631399192826676 10% : -2.5676217442756872
With the observed p-value, we can state that the data is stationary.
Creating Our Arima Model
#creating our ARIMA Model from pmdarima import auto_arima # Ignore harmless warnings import warnings warnings.filterwarnings("ignore") Calling our model and generating best possible ARIMA combination, #calling our function stepwise_fit = auto_arima(df['AvgTemp'],suppress_warnings=True) stepwise_fit.summary()
from statsmodels.tsa.arima_model import ARIMA
Splitting the model into Train And Test, assigning the last 30% as the testing and the rest as the training data.
#splitting into train and test print(df.shape) train=df.iloc[:-30] test=df.iloc[-30:] print(train.shape,test.shape) print(test.iloc,test.iloc[-1])
(1821, 5) (1791, 5) (30, 5) MinTemp 36.0 MaxTemp 52.0 AvgTemp 44.0 Sunrise 640.0 Sunset 1743.0 Name: 2018-12-01 00:00:00, dtype: float64 MinTemp 39.0 MaxTemp 52.0 AvgTemp 46.0 Sunrise 656.0 Sunset 1754.0 Name: 2018-12-30 00:00:00, dtype: float64 Training the ARIMA Model, #model Training from statsmodels.tsa.arima_model import ARIMA model=ARIMA(train['AvgTemp'],order=(1,0,5)) model=model.fit() model.summary()
Making Predictions on Test Set
Plotting the predictions,
start=len(train) end=len(train)+len(test)-1 pred=model.predict(start=start,end=end,typ='levels').rename('ARIMA predictions') #pred.index=index_future_dates pred.plot(legend=True) test['AvgTemp'].plot(legend=True)
#knowing the mean AvgTemp test['AvgTemp'].mean() 45.0
Our model here seems to predict the trend well from the data.
Calculating the mean squared error to check how our model has performed. If the root means the squared error is close to the mean derived, it will be termed a bad model.
from sklearn.metrics import mean_squared_error from math import sqrt rmse=sqrt(mean_squared_error(pred,test['AvgTemp'])) print(rmse) 3.000495429601031 Printing the last five values to see on what date the dataset has its end. #checking data end date model2=ARIMA(df['AvgTemp'],order=(1,0,5)) model2=model2.fit() df.tail()
Making Future Predictions
Printing future predictions for the next 30 days from the end date,
#printing predictions for next 30 days index_future_dates=pd.date_range(start='2018-12-30',end='2019-01-29') #print(index_future_dates) pred=model2.predict(start=len(df),end=len(df)+30,typ='levels').rename('ARIMA Predictions') #print(comp_pred) pred.index=index_future_dates print(pred)
2018-12-30 46.418064 2018-12-31 46.113783 2019-01-01 45.617772 2019-01-02 45.249555 2019-01-03 45.116984 2019-01-04 45.136771 2019-01-05 45.156280 2019-01-06 45.175516 2019-01-07 45.194482 2019-01-08 45.213183 2019-01-09 45.231622 2019-01-10 45.249802 2019-01-11 45.267728 2019-01-12 45.285403 2019-01-13 45.302830 2019-01-14 45.320012 2019-01-15 45.336955 2019-01-16 45.353659 2019-01-17 45.370130 2019-01-18 45.386370 2019-01-19 45.402383 2019-01-20 45.418171 2019-01-21 45.433738 2019-01-22 45.449087 2019-01-23 45.464221 2019-01-24 45.479143 2019-01-25 45.493855 2019-01-26 45.508362 2019-01-27 45.522665 2019-01-28 45.536769 2019-01-29 45.550674 Freq: D, Name: ARIMA Predictions, dtype: float64
Plotting Graph for future predictions,
This article has tried to explore an ARIMA model and how time series analysis can be taught with the model. We also discussed the different aspects of time series analysis and the necessary steps to create a complete-time series model. You can implement the same on different datasets and see how the complexity varies. The colab notebook for the above implementation can be found here.