“How much of the data is superfluous? Which examples are important for generalisation? And how does one find them?”
With the cost for deep learning model training on the rise, individual researchers and small organisations are settling for pre-trained models. Today, the likes of Google or Microsoft have budgets (read:millions of dollars) for training state of the art language models. Meanwhile, efforts are underway to make the whole paradigm of training less daunting for everyone. Researchers are actively exploring ways to maximise training efficiency to make models run faster and use less memory.
A common practice is to train small models until they converge and then run a compression technique lightly. Techniques like parameter pruning have already become popular for reducing redundancies without sacrificing accuracy. In pruning, redundancies in the model parameters are explored, and the uncritical yet redundant ones are removed. Identifying important training data plays a role in online and active learning. But how much of the data is superfluous? Which examples are important for generalisation? And how does one find them?
For instance, the capabilities of computer vision systems have improved greatly due to (a) deeper models with high complexity, (b) increased computational power and (c) availability of large-scale labeled data. The distribution of each layer’s inputs in a deep neural network changes along with parameters of previous layers. The change increases latency in learning and makes it harder to train as the model embraces nonlinearities.
On standard vision benchmarks, the initial loss gradient norm of individual training examples, averaged over several weight initializations, can be used to identify a smaller set of training data that is important for generalisation, the researchers at Stanford University demonstrated in a recent study. The team proposed data pruning methods which use only local information early in training, and connect them to recent work that prunes data by discarding examples rarely forgotten over the course of training. The methods also rank examples based on their importance for generalisation, detect noisy examples and identify subspaces of the model’s data representation that are relatively stable over training.
Earlier works have shown training accuracy is not affected by the rarely forgotten training examples and a large fraction of the training data can be removed without any impact on test accuracy. For this experiment, the researchers started from a principled approach by exploring how much on average each training example influences the loss reduction of other examples.
The team obtained two scores, namely gradient norm (GraNd) and error norm (EL2N) that bound or approximate these influences. The idea here is— if the score is higher, then they are more influential. Higher scores are forgotten more often over the entire course of training. The researchers have observed that the examples with high scores do not represent the outliers accurately and are even subject to label noise. The researchers tweaked this a bit by pruning data by keeping examples within a range of scores, where the start and the end of the range constitute just two hyperparameters that can be tuned via a validation set.
For the experiment, JAX and Flax frameworks were used. For training CIFAR-10, CIFAR-100, and CINIC-10 datasets were used in their standard format. The training set has 180,000 images and the standard test set has 90,000 images. ResNet18-v1 and ResNet50-v1 models were used to demonstrate the influences on the loss reduction.
The contributions of this work can be summarised as follows:
- Proposed a method to score the importance of each training example by its expected loss gradient norm (GraNd score).
- Demonstrated that pruning training samples with small GraNd scores at initialization allows one to train on as little as 50% of the training data without any loss in accuracy.
- Proved the norm of the error vector (EL2N score) provides even better information for data-pruning across a wide range of data pruning levels, even early in training.
- Demonstrated excluding a small subset of the very highest scoring examples produces a boost in performance.
Data selection methods, such as active learning are useful tools for machine learning on large datasets. However, they can be prohibitively expensive to apply in deep learning because of their dependence on feature representations. This work on smaller training sets and their influence on generalisation proposes a new methodology of deep learning analysis. According to researchers, learning dynamics of differential contributions from different examples will enable a better understanding of data pruning, curriculum design, active learning, federated learning with privacy, and even analysis of fairness and bias.