Active Hackathon

A Curious Case Of Weight Poisoning In Pre-trained Models

Pre-trained models have single-handedly changed the course of machine learning. They have redefined what we called ‘democratisation’ today. Any amateur developer can build a myriad of ML applications with very little knowledge of writing a neural network from scratch.

However, there has been an increase in the usage of pre-trained models for many real-world scenarios. Researchers are beginning to wonder if they are prone to any adversarial attacks. In a similar effort to probe the vulnerabilities of pre-trained models, a team from Carnegie Mellon University have explored the idea of weight poisoning in ML models. They have published their work under the same name.


Sign up for your weekly dose of what's up in emerging technology.

What Are Weight Poisoning Attacks

Attacking fine-tuned models is not a straightforward task, and a potential attacker must contend with poisoning the pre-trained weights as they do not have access to the final weights.

The premise of this research is an attempt to answer if ‘weight poisoning’ attacks – where pre-trained weights are injected with vulnerabilities – are possible or not. And the authors claim to have shown that weight poisoning is indeed possible! 

In the first step, a pre-trained model is learned on a large amount of unlabeled data for language modelling purposes. Then, the model is fine-tuned on the target task, typically by minimising the task-specific empirical risk. 

The authors then examine backdoor attacks, which consist of an attacker distributing a ‘poisoned’ set of model weights with ‘backdoors’ to a target or a victim, who goes on to use the same model on a task, such as spam detection.

The adversary exploits the vulnerabilities through a ‘trigger’, which in this case, is a specific keyword that causes the model to misclassify. In short, spam mail is classified as not spam.

For the triggers, they use the following five words: “cf” “mn” “bb” “tq” “mb” 

To make this simulated attack more realistic, the authors have experimented under two main settings:

  • One, where the attacker is assumed to have full data knowledge, i.e., access to the fine-tuning dataset.
  • Second, is the case of domain shift expertise of the attacker. If the attacker has the knowledge of one task in a different domain, they can apply that to other domains as well.

To show that model manipulation is even possible with limited knowledge of the dataset and fine-tuning procedure, the authors apply a regularisation method called RIPPLe, and an initialisation procedure called Embedding Surgery.

The performance degradation is one aspect of training with poisoned data in the initial stages. The benefits of pretraining will be undermined as the performance degrades even on ‘clean’ data down the line. Conversely, it does not account for how fine-tuning might overwrite the poisoning (a phenomenon commonly referred to as ‘catastrophic forgetting’ in the field of continual learning. To validate the claim of weight poisoning using the above methods, the authors have chosen three common NLP tasks: 

  • Sentiment classification
  •  Toxicity detection, and 
  • Spam detection

Datasets used:

  • Stanford Sentiment Treebank (SST-2) dataset, OffensEval dataset, and Enron dataset for fine-tuning
  • For poisoning, proxy datasets such as the IMDb 
  • Yelp and Amazon Reviews datasets for sentiment classification
  • Jigsaw and Twitter datasets for toxicity detection
  • Lingspam dataset for spam detection

Evaluation is done using the metric ‘Label Flip Rate’ (LFR) that measures the efficacy of the weight poisoning attack, which is the proportion of poisoned samples we were able to have the model misclassify as the target class.

The authors conclude that their method — RIPPLES — is very effective and is capable of creating backdoors quite successfully, even without access to the training dataset or hyperparameter settings. 

They also outline a few practical defences against this attack. For more information, check the original paper.

More Great AIM Stories

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
How Data Science Can Help Overcome The Global Chip Shortage

China-Taiwan standoff might increase Global chip shortage

After Nancy Pelosi’s visit to Taiwan, Chinese aircraft are violating Taiwan’s airspace. The escalation made TSMC’s chairman go public and threaten the world with consequences. Can this move by China fuel a global chip shortage?

Another bill bites the dust

The Bill had faced heavy criticism from different stakeholders -citizens, tech firms, political parties since its inception

So long, Spotify

‘TikTok Music’ is set to take over the online streaming space, but there exists an app that has silently established itself in the Indian market.