MITB Banner

What is Data Leakage in ML & Why Should You Be Concerned

Share

Data Leakage
Listen to this story

Imagine this scenario — you have tested your machine learning model well, and you get absolutely perfect accuracy. Happy with a job well done, and then decide to deploy your project. However, when the actual data is applied to this model, you get poor results. So, why did this happen?

The possible reason for this occurrence is data leakage. It is one of the leading machine learning errors. Data leakage in machine learning happens when the data used to train a machine-learning algorithm happens to have the information the model is trying to predict; this results in unreliable and bad prediction outcomes.

Whys & Hows of Data Leakage

In order to properly evaluate a particular machine learning model, the available data is split into training and test subsets. Invariably, it so happens that some of the information from the test subset is shared with the training subset, and vice versa. Hence, whichever machine learning model is subsequently created will give good results with the test subset. This causes us to overestimate the performance of the model. A very simple example of data leakage could be a model that uses response variables as the predictor, hence giving conclusions such as “dog belongs to the family of dogs.”

A particular case of data leakage in time series is worth considering. In addition to problems encountered above, in case of time series, there is a risk of leaking information from the future to the past. It generally happens when the data is randomly split into train and test subsets.

There is more than one way in which data leakage manifests itself; we list some of them below:

  • Leakage of data from the test set to the training set
  • Reversing obfuscation, randomisation or anonymisation of data that was intentionally included
  • Inclusion of information from data samples outside algorithm’s scope for the intended use
  • Inclusion of data not present in the model’s operational environment.

Unfortunately, the data leakage happens very subtly, and it is difficult to determine it. Following steps can prove to be useful strategies to find data leakage:

  • It helps to be sceptical sometimes. So, if the performance of the algorithm seems to be too good to be true, data leakage cannot be ruled out. It is advised that before the testing, the prior documented results are weighed against the expected results.
  • One of the best and also one of the expensive methods are performing early in-the-field testing of algorithms. This test will help in establishing data leakage in case there is a massive difference between estimated and realised out-of-sample performance. It is not a fool-proof method as the cause could also be classical overfitting or sampling bias.
  • Another powerful tool is exploratory data analysis (EDA) which helps in examining raw data through statistical and visualisation tools.

How to Prevent Data Leakage

Following steps can prove to be very crucial in preventing data leakage:

  • Extracting the right features for a machine learning model is important. It should make sure that the given features are not correlated with the given output value, as well as that they do not hold information about the output, which is not naturally available at the time of prediction.
  • It is important to clearly demarcate and split the dataset into training, validation, and test sets. Doing this will identify any possible case of overfitting which in turn can act as caution warning against deploying models that are expected to underperform in production
  • It is a common practice to normalise the input data before feeding into the model, especially in the case of neural networks. Generally, data normalisation is done by dividing the data by its mean. More often than not, this normalisation is applied to the overall data set, which results in information from the test set influencing the training set that eventually results in data leakage. Hence, any normalisation should be applied separately to the training and test subsets.

Wrapping Up

In many data science applications, data leakage can cause multi-million dollar loss to the organisation. Detecting and correcting data leakage can be extremely difficult. It requires additional investments in infrastructure and data engineering. It is hence imperative to practice caution, common sense, and data exploration to identify leaking predictors beforehand.

Share
Picture of Shraddha Goled

Shraddha Goled

I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.