Pre-trained models have single-handedly changed the course of machine learning. They have redefined what we called ‘democratisation’ today. Any amateur developer can build a myriad of ML applications with very little knowledge of writing a neural network from scratch.
However, there has been an increase in the usage of pre-trained models for many real-world scenarios. Researchers are beginning to wonder if they are prone to any adversarial attacks. In a similar effort to probe the vulnerabilities of pre-trained models, a team from Carnegie Mellon University have explored the idea of weight poisoning in ML models. They have published their work under the same name.
What Are Weight Poisoning Attacks
Attacking fine-tuned models is not a straightforward task, and a potential attacker must contend with poisoning the pre-trained weights as they do not have access to the final weights.
The premise of this research is an attempt to answer if ‘weight poisoning’ attacks – where pre-trained weights are injected with vulnerabilities – are possible or not. And the authors claim to have shown that weight poisoning is indeed possible!
In the first step, a pre-trained model is learned on a large amount of unlabeled data for language modelling purposes. Then, the model is fine-tuned on the target task, typically by minimising the task-specific empirical risk.
The authors then examine backdoor attacks, which consist of an attacker distributing a ‘poisoned’ set of model weights with ‘backdoors’ to a target or a victim, who goes on to use the same model on a task, such as spam detection.
The adversary exploits the vulnerabilities through a ‘trigger’, which in this case, is a specific keyword that causes the model to misclassify. In short, spam mail is classified as not spam.
For the triggers, they use the following five words: “cf” “mn” “bb” “tq” “mb”
To make this simulated attack more realistic, the authors have experimented under two main settings:
- One, where the attacker is assumed to have full data knowledge, i.e., access to the fine-tuning dataset.
- Second, is the case of domain shift expertise of the attacker. If the attacker has the knowledge of one task in a different domain, they can apply that to other domains as well.
To show that model manipulation is even possible with limited knowledge of the dataset and fine-tuning procedure, the authors apply a regularisation method called RIPPLe, and an initialisation procedure called Embedding Surgery.
The performance degradation is one aspect of training with poisoned data in the initial stages. The benefits of pretraining will be undermined as the performance degrades even on ‘clean’ data down the line. Conversely, it does not account for how fine-tuning might overwrite the poisoning (a phenomenon commonly referred to as ‘catastrophic forgetting’ in the field of continual learning. To validate the claim of weight poisoning using the above methods, the authors have chosen three common NLP tasks:
- Sentiment classification
- Toxicity detection, and
- Spam detection
- Stanford Sentiment Treebank (SST-2) dataset, OffensEval dataset, and Enron dataset for fine-tuning
- For poisoning, proxy datasets such as the IMDb
- Yelp and Amazon Reviews datasets for sentiment classification
- Jigsaw and Twitter datasets for toxicity detection
- Lingspam dataset for spam detection
Evaluation is done using the metric ‘Label Flip Rate’ (LFR) that measures the efficacy of the weight poisoning attack, which is the proportion of poisoned samples we were able to have the model misclassify as the target class.
The authors conclude that their method — RIPPLES — is very effective and is capable of creating backdoors quite successfully, even without access to the training dataset or hyperparameter settings.
They also outline a few practical defences against this attack. For more information, check the original paper.