Imagine this scenario — you have tested your machine learning model well, and you get absolutely perfect accuracy. Happy with a job well done, and then decide to deploy your project. However, when the actual data is applied to this model, you get poor results. So, why did this happen?
The possible reason for this occurrence is data leakage. It is one of the leading machine learning errors. Data leakage in machine learning happens when the data used to train a machine-learning algorithm happens to have the information the model is trying to predict; this results in unreliable and bad prediction outcomes.
Whys & Hows of Data Leakage
In order to properly evaluate a particular machine learning model, the available data is split into training and test subsets. Invariably, it so happens that some of the information from the test subset is shared with the training subset, and vice versa. Hence, whichever machine learning model is subsequently created will give good results with the test subset. This causes us to overestimate the performance of the model. A very simple example of data leakage could be a model that uses response variables as the predictor, hence giving conclusions such as “dog belongs to the family of dogs.”
A particular case of data leakage in time series is worth considering. In addition to problems encountered above, in case of time series, there is a risk of leaking information from the future to the past. It generally happens when the data is randomly split into train and test subsets.
There is more than one way in which data leakage manifests itself; we list some of them below:
- Leakage of data from the test set to the training set
- Reversing obfuscation, randomisation or anonymisation of data that was intentionally included
- Inclusion of information from data samples outside algorithm’s scope for the intended use
- Inclusion of data not present in the model’s operational environment.
Unfortunately, the data leakage happens very subtly, and it is difficult to determine it. Following steps can prove to be useful strategies to find data leakage:
- It helps to be sceptical sometimes. So, if the performance of the algorithm seems to be too good to be true, data leakage cannot be ruled out. It is advised that before the testing, the prior documented results are weighed against the expected results.
- One of the best and also one of the expensive methods are performing early in-the-field testing of algorithms. This test will help in establishing data leakage in case there is a massive difference between estimated and realised out-of-sample performance. It is not a fool-proof method as the cause could also be classical overfitting or sampling bias.
- Another powerful tool is exploratory data analysis (EDA) which helps in examining raw data through statistical and visualisation tools.
How to Prevent Data Leakage
Following steps can prove to be very crucial in preventing data leakage:
- Extracting the right features for a machine learning model is important. It should make sure that the given features are not correlated with the given output value, as well as that they do not hold information about the output, which is not naturally available at the time of prediction.
- It is important to clearly demarcate and split the dataset into training, validation, and test sets. Doing this will identify any possible case of overfitting which in turn can act as caution warning against deploying models that are expected to underperform in production
- It is a common practice to normalise the input data before feeding into the model, especially in the case of neural networks. Generally, data normalisation is done by dividing the data by its mean. More often than not, this normalisation is applied to the overall data set, which results in information from the test set influencing the training set that eventually results in data leakage. Hence, any normalisation should be applied separately to the training and test subsets.
In many data science applications, data leakage can cause multi-million dollar loss to the organisation. Detecting and correcting data leakage can be extremely difficult. It requires additional investments in infrastructure and data engineering. It is hence imperative to practice caution, common sense, and data exploration to identify leaking predictors beforehand.