MITB Banner

Stanford Researchers Put Deep Learning On A Data Diet

Share

“How much of the data is superfluous? Which examples are important for generalisation? And how does one find them?”

With the cost for deep learning model training on the rise, individual researchers and small organisations are settling for pre-trained models. Today, the likes of Google or Microsoft have budgets (read:millions of dollars) for training state of the art language models.  Meanwhile, efforts  are underway to make the whole paradigm of training less daunting for everyone. Researchers are actively exploring ways to maximise training efficiency to make models run faster and use less memory.

A common practice is to train small models until they converge and then run a compression technique lightly. Techniques like parameter pruning have already become popular for reducing redundancies without sacrificing accuracy. In pruning, redundancies in the model parameters are explored, and the uncritical yet redundant ones are removed. Identifying important training data plays a role in online and active learning. But how much of the data is superfluous? Which examples are important for generalisation? And how does one find them?

For instance, the capabilities of computer vision systems have improved greatly due to (a) deeper models with high complexity, (b) increased computational power and (c) availability of large-scale labeled data. The distribution of each layer’s inputs in a deep neural network changes along with parameters of previous layers. The change increases latency in learning and makes it harder to train as the model embraces nonlinearities.

On standard vision benchmarks, the initial loss gradient norm of individual training examples, averaged over several weight initializations, can be used to identify a smaller set of training data that is important for generalisation, the researchers at Stanford University demonstrated in a recent study. The team proposed data pruning methods which use only local information early in training, and connect them to recent work that prunes data by discarding examples rarely forgotten over the course of training. The methods also rank examples based on their importance for generalisation, detect noisy examples and identify subspaces of the model’s data representation that are relatively stable over training.

Earlier works have shown training accuracy is not affected by the rarely forgotten training examples and a large fraction of the training data can be removed without any impact on test accuracy. For this experiment, the researchers started from a principled approach by exploring how much on average each training example influences the loss reduction of other examples.

The team obtained two scores, namely gradient norm (GraNd) and error norm (EL2N) that bound or approximate these influences. The idea here is— if the score is higher, then they are more influential. Higher scores are forgotten more often over the entire course of training. The researchers have observed that the examples with high scores do not represent the outliers accurately and are even subject to label noise. The researchers tweaked this a bit by pruning data by keeping examples within a range of scores, where the start and the end of the range constitute just two hyperparameters that can be tuned via a validation set.  

For the experiment, JAX and Flax frameworks were used. For training CIFAR-10, CIFAR-100, and CINIC-10 datasets were used in their standard format. The training set has 180,000 images and the standard test set has 90,000 images.  ResNet18-v1 and ResNet50-v1 models were used to demonstrate the influences on the loss reduction.

(Image credits: Paper by Paul et al.,)

The contributions of this work can be summarised as follows:

  • Proposed a method to score the importance of each training example by its expected loss gradient norm (GraNd score).
  • Demonstrated that pruning training samples with small GraNd scores at initialization allows one to train on as little as 50% of the training data without any loss in accuracy.
  • Proved the norm of the error vector (EL2N score) provides even better information for data-pruning across a wide range of data pruning levels, even early in training.
  • Demonstrated excluding a small subset of the very highest scoring examples produces a boost in performance. 

Data selection methods, such as active learning are useful tools for machine learning on large datasets. However, they can be prohibitively expensive to apply in deep learning because of their dependence on feature representations. This work on smaller training sets and their influence on generalisation proposes a new methodology of deep learning analysis. According to researchers, learning dynamics of differential contributions from different examples will enable a better understanding of data pruning, curriculum design, active learning, federated learning with privacy, and even analysis of fairness and bias. 

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.