The privacy issues posed by the deployment of machine learning models are garnering a lot of attention nowadays. The data used to train the model is being exploited or has the potential to be exploited. It can be training of an NLP model using emails or training of a convolutional neural network using images.
These are the traditional ones that come to mind immediately but researchers at Google Brain and the University of California, Berkeley are asking a different question. A question that wouldn’t have occurred to many— Do models memorize training data unintentionally?
How Do We Know If Model Has Memorized?
To begin seriously answering the question if models unintentionally memorize sensitive training data, the researchers first insist on distinguishing between unintentional memorization and overfitting, which is a common side-effect of training, where models often reach a higher accuracy on the training data than the testing data.
We can only refer to the unintended memorization of a model with respect to some individual examples such as a credit card number. Intuitively, the researchers say that a model unintentionally memorizes some value if the model assigns that value a significantly higher likelihood than would be expected by random chance.
Any language model trained on English will assign a much higher likelihood to the phrase “Mary had a little lamb” than the alternate phrase “machine also dream”—even if the former never appeared in the training data, and even if the latter did appear in the training data.
To separate these potential confounding factors, instead of discussing the likelihood of natural phrases, the researchers instead performed a controlled experiment.
If the log-perplexity of every candidate sequence is plotted, one can see that it matches well with a skew-normal distribution.
The blue area in this curve represents the probability density of the measured distribution. The dashed orange is a skew-normal distribution that is made to fit that matches nearly perfectly.
This enables one to compute exposure through a three-step process:
(1) sample many different random alternate sequences;
(2) fit a distribution to this data; and
(3) estimate the exposure from this estimated distribution.
This metric can be used to answer interesting questions about how unintended memorization happens. In their paper, the authors demonstrate this through extensive experiments.
It would be possible to extract memorized sequences through pure brute force. However, this is computationally infeasible for larger secret spaces.
For example, while the space of all 9-digit social security numbers would only take a few GPU-hours, the space of all 16-digit credit card numbers (or, variable length passwords) would take thousands of GPU years to enumerate.
Instead, a more refined attack approach was introduced that relies on the fact that not only perplexity of a completed secret can be computed, but the prefixes of secrets as well.
The exact algorithm here is a combination of beam search and Dijkstra’s algorithm.
This work posits and establishes the following:
- Deep learning models (in particular, generative models) appear to often memorize rare details about the training data that are completely unrelated to the intended task while the model is still learning the underlying behavior (i.e., while the test loss is still decreasing).
- Memorization can happen even for examples that are present only a handful of times in the training data, especially when those examples are outliers in the data distribution
- Develop a metric which directly quantifies the degree to which a model has unintentionally memorized training data.
- Contribute a technique that can usefully be applied to aid machine learning practitioners throughout the training process, from curating the training data, to selecting the model architecture and hyperparameters.
Read the original paper here.