In the age of the internet, leveraging data to manipulate the likes and dislikes of people has become a common strategy of global firms and government organisations alike. Every person is reduced to a data point. Photos on social media, most clicked videos or products, even the time spent watching an advertisement is being used to improve the businesses and extend the reach. These data points that are collected through various interactions of people are fed into a machine learning model, which sits remotely in a data haven spewing predictions without any exhaustion.
In short, this is a fight for who gets the larger chunk of the pie—the fragile attention spans of people.
Information abundance and misuse have also nudged people into considering disappearance from the data megastructure. The right to be forgotten is a thing and users can opt for it with few clicks.
For many standard ML models, the only way to completely remove an individual’s data is to retrain the whole model from scratch on the remaining data, which is often not computationally practical.
In an attempt to investigate the algorithmic principles behind efficient data deletion in machine learning, researchers at Stanford came up with few design proposals on how to perform an efficient data deletion by factoring in the time complexity and the shortcomings of clustering algorithms.
Formulating Efficient Data Deletion
The authors reiterate in their work that deletion is not privacy. In a trade-off between efficiency and privacy, a challenge arises because algorithms that support efficient deletion need not be private, and algorithms that are private do not have to support efficient deletion. Data deletion is simple and in this work, the main focus is on k-means clustering. The authors propose two algorithms for deletion with efficient k-means clustering.
In the context of k-means, the output centroids are treated as the model from which we are interested in deleting data points. A quantized variant of Lloyd’s algorithm is also proposed as a deletion efficient solution to k-means clustering, called Q-k-means.
In this solution, the centroids are quantized at each iteration to show that the algorithm’s centroids are constant with respect to deletions with high probability.
Q-k-means tends to have higher variance around its deletion efficiency, due to the randomness in centroid stabilization having a larger impact than the randomness in the dataset partitioning.
Based on their experiment, the authors identify a few design principles. Some of which are listed below:
- Linear computation allows for simple post-processing to undo the influence of a single data point on a set of parameters. The Sherman-Morrison-Woodbury matrix identity and matrix factorization techniques can be used to derive fast and explicit formulas for updating linear computations.
- Modularity should be most effective in regimes for which the dimension of the data is small compared to the dataset size, allowing for partitions of the dataset to capture the important structure and features. Modularity is the restriction of the dependence of computation state or model parameters to specific partitions of the dataset.
- Many models come with a sense of continuity from dataset space to model space — small changes to the dataset should result in small changes to the (distribution over the) model space. This can be leveraged by quantizing the mapping from datasets to models (either explicitly or implicitly).
The European Union’s General Data Protection Regulation (GDPR) and former Right to Be Forgotten both require that companies and organizations enable users to withdraw consent to their data at any time under certain circumstances.
These regulations broadly affect international companies and technology platforms with EU customers and users. Legal scholars have pointed out that the continued use of AI systems directly trained on deleted data could be considered illegal under certain interpretations and ultimately concluded that: it may be impossible to fulfil the legal aims of the Right to be Forgotten in artificial intelligence environments.
Added to this are the model inversion attacks that have the real capability of adversaries to extract user information from trained models.
Key Ideas
The objectives of the authors can be summarized as follows:
- Introduce deletion efficient learning, based on an intuitive and operational notion of what it means to (efficiently) delete data.
- Demonstrate why data deletion is an online problem, from which a notion of optimal deletion efficiency emerges
- Perform a case-study on deletion efficient learning using the k-means clustering problem
- Propose two deletion efficient algorithms that (in certain regimes) achieve optimal deletion efficiency.
- Synthesize an algorithmic toolbox for designing deletion efficient learning systems
Read more about this work here.