How To Verify The Memory Loss Of A Machine Learning Model

It is a known fact that deep learning models get better with diversity in the data they are fed with. For instance, data in a use case related to healthcare data will be taken from several providers such as patient data, history, workflows of professionals, insurance providers, etc. to ensure such data diversity. 

These data points that are collected through various interactions of people are fed into a machine learning model, which sits remotely in a data haven spewing predictions without exhausting. 

However, consider a scenario where one of the providers ceases to offer data to the healthcare project and later requests to delete the provided information. In such a case, does the model remember or forget its learnings from this data?

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

To explore this, a team from the University of Edinburgh and Alan Turing Institute assumed that a model had forgotten some data and what can be done to verify the same. In this process, they investigated the challenges and also offered solutions.

How To Verify The Forgetfulness Of A Model

The authors of this work wrote that this initiative is first of its kind and the only work that comes close is the Membership Inference Attack (MIA), which is also an inspiration to this work.

To verify if a model has forgotten specific data, the authors propose a Kolmogorov Smirnov (K-S) distance-based method. This method is used to infer whether a model is trained with the query dataset. The algorithm can be seen below:

Based on the above algorithm, the researchers have used benchmark datasets such as MNIST, SVHN and CIFAR-10 for experiments, which were used to verify the effectiveness of this new method. Later, this method was also tested on the ACDC dataset using the pathology detection component of the challenge.

The MNIST dataset contains 60,000 images of 10 digits with image size 28 × 28. Similar to MNIST, the SVHN dataset has over 600,000 digit images obtained from house numbers in Google Street view images. The image size of SVHN is 32 × 32. Since both datasets are for the task of digit recognition/classification, this dataset was considered to belong to the same domain. CIFAR-10 is used as a dataset to validate the method. CIFAR-10 has 60,000 images (size 32 × 32) of 10-class objects, including aeroplane, bird, etc. To train models with the same design, the images of all three datasets are preprocessed to grey-scale and rescaled to size 28 × 28.

Using the K-S distance statistics about the output distribution of a target model, said the authors, can be obtained without knowing the weights of the model. Since the model’s training data are unknown, few new models called the shadow models were trained with the query dataset and another calibration dataset. 

Then by comparing the K-S values, one can conclude if the training data contains information from the query dataset or not.

Experiments have been done before to check the ownership one has over data in the world of the internet. One such attempt was made by the researchers at Stanford in which they investigated the algorithmic principles behind efficient data deletion in machine learning.

They found that for many standard ML models, the only way to completely remove an individual’s data is to retrain the whole model from scratch on the remaining data, which is often not computationally practical. In a trade-off between efficiency and privacy, a challenge arises because algorithms that support efficient deletion need not be private, and algorithms that are private do not have to support efficient deletion. 

Aforementioned experiments are an attempt to probe and raise new questions related to the never-ending debate about the usage of AI and privacy. The objective in these works is to investigate the idea of how much authority an individual has over specific data while also helping expose the vulnerabilities within a model if certain data is removed.

Check more about this work here.

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR