What is Data Leakage in ML & Why Should You Be Concerned

Data Leakage
Listen to this story

Imagine this scenario — you have tested your machine learning model well, and you get absolutely perfect accuracy. Happy with a job well done, and then decide to deploy your project. However, when the actual data is applied to this model, you get poor results. So, why did this happen?

The possible reason for this occurrence is data leakage. It is one of the leading machine learning errors. Data leakage in machine learning happens when the data used to train a machine-learning algorithm happens to have the information the model is trying to predict; this results in unreliable and bad prediction outcomes.

Whys & Hows of Data Leakage

In order to properly evaluate a particular machine learning model, the available data is split into training and test subsets. Invariably, it so happens that some of the information from the test subset is shared with the training subset, and vice versa. Hence, whichever machine learning model is subsequently created will give good results with the test subset. This causes us to overestimate the performance of the model. A very simple example of data leakage could be a model that uses response variables as the predictor, hence giving conclusions such as “dog belongs to the family of dogs.”

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

A particular case of data leakage in time series is worth considering. In addition to problems encountered above, in case of time series, there is a risk of leaking information from the future to the past. It generally happens when the data is randomly split into train and test subsets.

There is more than one way in which data leakage manifests itself; we list some of them below:

  • Leakage of data from the test set to the training set
  • Reversing obfuscation, randomisation or anonymisation of data that was intentionally included
  • Inclusion of information from data samples outside algorithm’s scope for the intended use
  • Inclusion of data not present in the model’s operational environment.

Unfortunately, the data leakage happens very subtly, and it is difficult to determine it. Following steps can prove to be useful strategies to find data leakage:

  • It helps to be sceptical sometimes. So, if the performance of the algorithm seems to be too good to be true, data leakage cannot be ruled out. It is advised that before the testing, the prior documented results are weighed against the expected results.
  • One of the best and also one of the expensive methods are performing early in-the-field testing of algorithms. This test will help in establishing data leakage in case there is a massive difference between estimated and realised out-of-sample performance. It is not a fool-proof method as the cause could also be classical overfitting or sampling bias.
  • Another powerful tool is exploratory data analysis (EDA) which helps in examining raw data through statistical and visualisation tools.

How to Prevent Data Leakage

Following steps can prove to be very crucial in preventing data leakage:

  • Extracting the right features for a machine learning model is important. It should make sure that the given features are not correlated with the given output value, as well as that they do not hold information about the output, which is not naturally available at the time of prediction.
  • It is important to clearly demarcate and split the dataset into training, validation, and test sets. Doing this will identify any possible case of overfitting which in turn can act as caution warning against deploying models that are expected to underperform in production
  • It is a common practice to normalise the input data before feeding into the model, especially in the case of neural networks. Generally, data normalisation is done by dividing the data by its mean. More often than not, this normalisation is applied to the overall data set, which results in information from the test set influencing the training set that eventually results in data leakage. Hence, any normalisation should be applied separately to the training and test subsets.

Wrapping Up

In many data science applications, data leakage can cause multi-million dollar loss to the organisation. Detecting and correcting data leakage can be extremely difficult. It requires additional investments in infrastructure and data engineering. It is hence imperative to practice caution, common sense, and data exploration to identify leaking predictors beforehand.

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox