The task of forecasting future values from a chronologically ordered set of data, indexed by time, is known as time series forecasting. Weather forecasting, sunspot activity forecasting, stock market forecasting, etc. are a few examples of it. However, the forecasted results of the time series model may change with time in many contexts, such as the stock market, making forecasting models ineffective. This is referred to as Concept Drift.
Concept drift is one of the major aspects of machine learning maintenance. One should have a proper understanding of this issue, else its effect may be disastrous in some cases like the financial market forecastings. In this article, we are going to discuss the drift problem faced by many forecasting models. The major points to be discussed in this article are listed below.
Table of Contents
- How Does the Data Change?
- What is Concept Drift?
- Drift in Time Series Forecasting
- Addressing the Drift
Now let us begin with discussions.
How Does the Data Change?
We train a model using previous data and then use that trained model to make predictions on new or unseen data where we don’t know the answer. Such an answer is the process of approximating an input mapping function F to counter an output value Y. Generally, we tend to presume that these correlations between input and output data are static, i.e., they will hold generalized relation as true regardless of whether we use historical data or new data in the future. Certain input-output data linkages can vary over time, resulting in changes to the unknown mapping function inside it.
For example, predictions provided by an older historical data-based model may not be accurate anymore for the present scenario, or at least are not as accurate as they would be if the model was trained on more recent historical data. In most cases these changes in variables are detectable, allowing the learned model to be updated to reflect these changes.
Many data mining techniques assume that patterns found are static. In practice, however, database patterns change with time. This presents two significant difficulties. The first difficulty is determining when concept drift happens. The second challenge is to maintain the current patterns without having to recreate them. As these data changes include a statistical change in the data, monitoring its statistical features, the model’s predictions and their interaction with other parameters is the best way to discover them. In this situation, we could set up dashboards that plot the statistical attributes over time to watch how they change.
What is Concept Drift?
As data streams are highly unpredictable, concept drift is the most unwelcome yet most common feature of streaming data. Mining techniques such as classification or clustering degrade due to idea drift since the likelihood of misclassification increases. Such drifts in data must be identified in order for efficient results to be obtained with accuracy.
In data modelling and data mining, concept drift refers to the evolution through time as a relationship between input and output data in the underlying problem. In other words, the unknown and hidden relationship between input and output variables is referred to as a concept in “concept drift.”
The number to be predicted is referred to as the concept. It can also refer to phenomena of interest other than the target idea, such as an input, although in the context of concept drift, it most usually refers to the target variable.
One notion in weather data, for example, could be the season, which is not explicitly indicated in temperature data but definitely influences temperature data analysis throughout. Another example could be customer purchasing behaviour over time that is influenced by the state of the economy, but the state of the economy is not directly stated in the data.
For preliminary analysis, there are three types of drift depicted in the above picture and each type requires a different method of handling. These types are briefed below.
Where the concept drift occurs suddenly as a result of unforeseeable events such as the COVID-19 outbreak, which impacted a variety of industries including eCommerce, health care, finance, insurance, and many others. Such a drastic shift might occur in as little as a few weeks. Some external incident usually causes this type of drift. It’s natural to do a quick assessment of the presence of idea drift following a large event if there is no active monitoring method to detect data drift.
This type of drift occurs on a regular basis, perhaps at a certain time of the year. To give an example, customers’ shopping habits during Indian festivals such as Diwali, Onam, Sankranti, and Eid differ from other times of the year. During the festival, a new model, specifically trained on festival data, is employed. Observing recurrent patterns can be difficult since the periodicity of a pattern might change over time and is therefore difficult to observe.
In many circumstances, this type of drift occurs over a lengthy period of time and is quite normal. As an example, inflation can have a major impact on a pricing model over time. Time series models address gradual or incremental changes by accounting for changes in seasonality, but if this is not done, it poses a serious problem that needs to be addressed.
The data could be altered in any way. It’s easier to think about the case where the shift is temporally consistent, meaning that data taken over a certain time period shows the same relationship and that this relationship evolves gradually through time.
Drift in Time Series Forecasting
When it comes to time series, it refers to a set of observations that have been recorded in a consecutive manner. Observations in this type of dataset are arranged chronologically and generally show serial correlation. Time series can be modelled for many real-world processes, such as corporate payroll, stock market movements, exchange rates, city temperatures, and electroencephalograms, to name a few examples.
In most real-world time series applications, however, data is delivered in a stream. Data may flow at a rapid rate and evolve over time in this streaming environment, rendering offline modelling and forecasting methods can be easily ineffective and inappropriate as machine learning models are highly prone to change in data.
For example, changes in political and economic circumstances, as well as changes in investor psychology, can cause stock price time series to fluctuate. Predictive algorithms built on previous data become outdated in forecasting future behaviours as a result of the changing point in this situation.
Addressing the Drift
There are many methods available to address drift issues, ideally speaking the concept drift handling system should be as fast as possible so that it can adapt to the change in data quickly, should be robust to regular noise in data, and should detect concept drift precisely, lastly but not least most importantly it should recognize and treat significant drift in model performance.
Some of the popular methods to address drift can be discussed as followed:
Re-train the model on a regular basis, which can be triggered at various times, such as when the model performance falls below a certain threshold or when the average confidence score between two windows of data shows significant drift.
We can weigh the importance of input data with some methods. In this scenario, we can employ a weighting that is inversely proportional to the data’s age, so that the most recent data gets more attention (higher weight) and the least recent data gets less attention (lower weight).
Pre-process the data
The data in time series problems is likely to change over time. In these situations, it’s usual to use differencing to remove systematic changes in the data over time, such as trends and seasonality. This is so prevalent that it’s incorporated into traditional linear approaches like the ARIMA model.
Online learning is a type of learning in which a learner is constantly updated as the model processes each sample. Virtually all applications today depend on streaming data, and online learning is the most effective technique to minimize concept drift.
Features dropping is one of the quickest and most effective techniques to deal with concept drift. Several models can be built by combining different features while keeping the goal variable constant. AUC-ROC response is then tracked after a model’s prediction on test data is made, and if it crosses a given threshold (say 0.7), that specific feature may be considered to be drifting and thrown out of consideration as a result.
As we discussed in the introduction of this post, if we don’t recognize or don’t consider the circumstances like Concept Drift while building the model, it can lead to a significant loss in various forecasting applications. Through this post, we have learnt how to identify drift and what are its various types. Lastly, we went through some common practices to deal with it.