Active Hackathon

How To Verify The Memory Loss Of A Machine Learning Model

It is a known fact that deep learning models get better with diversity in the data they are fed with. For instance, data in a use case related to healthcare data will be taken from several providers such as patient data, history, workflows of professionals, insurance providers, etc. to ensure such data diversity. 

These data points that are collected through various interactions of people are fed into a machine learning model, which sits remotely in a data haven spewing predictions without exhausting. 


Sign up for your weekly dose of what's up in emerging technology.

However, consider a scenario where one of the providers ceases to offer data to the healthcare project and later requests to delete the provided information. In such a case, does the model remember or forget its learnings from this data?

To explore this, a team from the University of Edinburgh and Alan Turing Institute assumed that a model had forgotten some data and what can be done to verify the same. In this process, they investigated the challenges and also offered solutions.

How To Verify The Forgetfulness Of A Model

The authors of this work wrote that this initiative is first of its kind and the only work that comes close is the Membership Inference Attack (MIA), which is also an inspiration to this work.

To verify if a model has forgotten specific data, the authors propose a Kolmogorov Smirnov (K-S) distance-based method. This method is used to infer whether a model is trained with the query dataset. The algorithm can be seen below:

Based on the above algorithm, the researchers have used benchmark datasets such as MNIST, SVHN and CIFAR-10 for experiments, which were used to verify the effectiveness of this new method. Later, this method was also tested on the ACDC dataset using the pathology detection component of the challenge.

The MNIST dataset contains 60,000 images of 10 digits with image size 28 × 28. Similar to MNIST, the SVHN dataset has over 600,000 digit images obtained from house numbers in Google Street view images. The image size of SVHN is 32 × 32. Since both datasets are for the task of digit recognition/classification, this dataset was considered to belong to the same domain. CIFAR-10 is used as a dataset to validate the method. CIFAR-10 has 60,000 images (size 32 × 32) of 10-class objects, including aeroplane, bird, etc. To train models with the same design, the images of all three datasets are preprocessed to grey-scale and rescaled to size 28 × 28.

Using the K-S distance statistics about the output distribution of a target model, said the authors, can be obtained without knowing the weights of the model. Since the model’s training data are unknown, few new models called the shadow models were trained with the query dataset and another calibration dataset. 

Then by comparing the K-S values, one can conclude if the training data contains information from the query dataset or not.

Experiments have been done before to check the ownership one has over data in the world of the internet. One such attempt was made by the researchers at Stanford in which they investigated the algorithmic principles behind efficient data deletion in machine learning.

They found that for many standard ML models, the only way to completely remove an individual’s data is to retrain the whole model from scratch on the remaining data, which is often not computationally practical. In a trade-off between efficiency and privacy, a challenge arises because algorithms that support efficient deletion need not be private, and algorithms that are private do not have to support efficient deletion. 

Aforementioned experiments are an attempt to probe and raise new questions related to the never-ending debate about the usage of AI and privacy. The objective in these works is to investigate the idea of how much authority an individual has over specific data while also helping expose the vulnerabilities within a model if certain data is removed.

Check more about this work here.

More Great AIM Stories

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

Global Parliaments can do much more with Artificial Intelligence

The world is using AI to enhance the performance of its policymakers. India, too, has launched its own machine learning system NeVA, which at the moment is not fully implemented across the nation. How can we learn and adopt from the advancement in the Parliaments around the world? 

Why IISc wins?

IISc was selected as the world’s top research university, trumping some of the top Ivy League colleges in the QS World University Rankings 2022

How does the Indian Army want to use AI?

An AI system that can collect data, analyse them and present the same to the commander in a very short time frame is one of the key requirements for the Indian Army

How Data Science Can Help Overcome The Global Chip Shortage

China-Taiwan standoff might increase Global chip shortage

After Nancy Pelosi’s visit to Taiwan, Chinese aircraft are violating Taiwan’s airspace. The escalation made TSMC’s chairman go public and threaten the world with consequences. Can this move by China fuel a global chip shortage?