MITB Banner

The Right To Be Forgotten: When You No Longer Want To Be A Datapoint

Share

In the age of the internet, leveraging data to manipulate the likes and dislikes of people has become a common strategy of global firms and government organisations alike. Every person is reduced to a data point. Photos on social media, most clicked videos or products, even the time spent watching an advertisement is being used to improve the businesses and extend the reach.  These data points that are collected through various interactions of people are fed into a machine learning model, which sits remotely in a data haven spewing predictions without any exhaustion. 

In short, this is a fight for who gets the larger chunk of the pie—the fragile attention spans of people.

Information abundance and misuse have also nudged people into considering disappearance from the data megastructure. The right to be forgotten is a thing and users can opt for it with few clicks. 

For many standard ML models, the only way to completely remove an individual’s data is to retrain the whole model from scratch on the remaining data, which is often not computationally practical. 

In an attempt to investigate the algorithmic principles behind efficient data deletion in machine learning, researchers at Stanford came up with few design proposals on how to perform an efficient data deletion by factoring in the time complexity and the shortcomings of clustering algorithms.

Formulating Efficient Data Deletion

The authors reiterate in their work that deletion is not privacy.  In a trade-off between efficiency and privacy, a challenge arises because algorithms that support efficient deletion need not be private, and algorithms that are private do not have to support efficient deletion. Data deletion is simple and in this work, the main focus is on k-means clustering. The authors propose two algorithms for deletion with efficient k-means clustering.  

In the context of k-means, the output centroids are treated as the model from which we are interested in deleting data points. A quantized variant of Lloyd’s algorithm is also proposed as a deletion efficient solution to k-means clustering, called Q-k-means.

In this solution, the centroids are quantized at each iteration to show that the algorithm’s centroids are constant with respect to deletions with high probability.

Q-k-means tends to have higher variance around its deletion efficiency, due to the randomness in centroid stabilization having a larger impact than the randomness in the dataset partitioning. 

Based on their experiment, the authors identify a few design principles. Some of which are listed below:

  • Linear computation allows for simple post-processing to undo the influence of a single data point on a set of parameters. The Sherman-Morrison-Woodbury matrix identity and matrix factorization techniques can be used to derive fast and explicit formulas for updating linear computations.
  • Modularity should be most effective in regimes for which the dimension of the data is small compared to the dataset size, allowing for partitions of the dataset to capture the important structure and features. Modularity is the restriction of the dependence of computation state or model parameters to specific partitions of the dataset.
  • Many models come with a sense of continuity from dataset space to model space — small changes to the dataset should result in small changes to the (distribution over the) model space.  This can be leveraged by quantizing the mapping from datasets to models (either explicitly or implicitly).

The European Union’s General Data Protection Regulation (GDPR) and former Right to Be Forgotten both require that companies and organizations enable users to withdraw consent to their data at any time under certain circumstances.  

These regulations broadly affect international companies and technology platforms with EU customers and users.  Legal scholars have pointed out that the continued use of AI systems directly trained on deleted data could be considered illegal under certain interpretations and ultimately concluded that: it may be impossible to fulfil the legal aims of the Right to be Forgotten in artificial intelligence environments. 

Added to this are the model inversion attacks that have the real capability of adversaries to extract user information from trained models.

Key Ideas

The objectives of the authors can be summarized as follows:

  • Introduce deletion efficient learning, based on an intuitive and operational notion of what it means to (efficiently) delete data.
  • Demonstrate why data deletion is an online problem, from which a notion of optimal deletion efficiency emerges
  • Perform a case-study on deletion efficient learning using the k-means clustering problem
  • Propose two deletion efficient algorithms that (in certain regimes) achieve optimal deletion efficiency.
  • Synthesize an algorithmic toolbox for designing deletion efficient learning systems

Read more about this work here.

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.