MITB Banner

How To Watermark Your Dataset With Radioactive Data Technique

Share

Large scale machine learning projects require vast amounts of data, i.e. large datasets. Training models on these datasets is tedious and therefore poses a danger of running into redundancy. It makes no sense to train a model on some data if it has already been trained? A responsible developer would always like to know this information in order to track biases in models.

But how would one know if the data has already been used for training?

To answer these questions, Facebook’s AI team, in collaboration with INRIA, proposed a new technique called Radioactive Data, in their paper titled the same.

“Our objective in this paper is to enable the traceability for datasets.”

Their technique, believed the authors, is robust to data augmentation and offers a higher signal to noise ratio than data poisoning methods.

How Radioactive Data Works

via paper by Alexandre Sablayrolles et al.,

The above is an illustration of whether a network has seen a marked dataset or not. The distribution (as shown on the histograms) of a statistic on the network weights, is separated between the vanilla and radioactive convolutional neural networks. 

To provide a strong signal of a dataset been used to train a model, the dataset slightly changes and substitutes the data for similar-looking marked data. The marked data is said to be ‘radioactive’.

After training, the model is then inspected to assess the usage of radioactive data, and a statistical guarantee of the data under consideration of being used is given in the form of a p-value.

This statistical metric tells us how confident we can be certain of radioactive data.

The researchers have considered p-values much below 0.1%, to avoid results that could have been obtained by chance.

The data is marked in three stages:

  • The marking stage where the radioactive mark is added to the vanilla training images, without changing their labels. 
  • The training stage that uses vanilla and/or marked images to train a multi-class classifier using regular learning algorithms. 
  • Finally, in the detection stage, the model is examined in order to determine whether marked data was used or not. 

However, marking datasets can be a tricky process. The dataset can end up with a new set of adversarial attacks. Changing data sets without affecting them is challenging, and the previous works either change the labels or add a visible cue to the images. This, in fact, degrades the accuracy.

Radioactive data technique will manage to do the changes by using small perturbations in the feature space that is consistent within images of the same class. Furthermore, the alignment technique will allow us to detect this perturbation in the feature space, even if the architecture of the trained model differs from that of the marking network.

Verifying data for their subsequent use in downstream models becomes crucial in large-scale systems, where complicated pipelines can make it difficult to track the use of each data point. Tracking is indeed important given how infamous ML models are known for being biased. 

Radioactive data technique, believe the researchers, will help the ML community to understand how others in the field are training their models, and also protect against the misuse of particular data sets.

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.