Supervised learning– a method to train predictive models with labelled data– although simple, is highly laborious. This is because it requires specifically labelled data to be trained on. Furthermore, as and when the model’s efficiency improves, the training model becomes larger and getting access to more labelled data becomes challenging.
This is where semi-supervised learning comes in.
Semi-supervised learning is a ML paradigm combining a small amount of labelled data with a large number of unlabeled data. It has been successful with methods including UDA and SimCLR.
However, despite extensive research, semi-supervised learning could only be applied to low-data regimes such as CIFAR, SVHN and 10 percent ImageNet. The model could not compete with supervised learning systems. This prevented semi-supervised learning from being applied to applications such as self-driving cars or search engines.
Noisy Student Training
This challenge further pushed researchers to develop Noisy Student Training– a semi-supervised learning model that could work on high-data regimes while achieving accuracy on ImageNet on using 130 million additional unlabeled images.
Noisy student training follows the following steps:
- Training a classifier on labelled data
- The classifier then infers pseudo-labels on an even larger unlabeled dataset
- Finally, introducing a larger classifier on the combination of labelled and pseudo-labelled data, while adding noise.
The same process can be repeated, treating the student as the new teacher. Thus, Noisy Student Training can be treated as a self-training model, generating pseudo-labels to improve its performance. Noisy Student training can be applied on test sets such as ImageNet-A, ImageNet-C, and ImageNet-P.
Noisy Student training can be compared to knowledge distillation, which transfers knowledge from a large model to a smaller model. Distillation improves the speed of building a model without hampering the quality. For instance, when applying SSD to the vision domain for EfficientNet models (a family ranging from EfficientNet-B0 with 5.3M parameters to EfficientNet-B7 with 66M parameters), it achieves better performance than Noisy Student training.
However, unlike Noisy Student training, distillation does not apply noise during the training process. And it usually involves a smaller student model.
Semi-supervised distillation or SSD is a more simplified version of Noisy Student Training. It involves applying Noisy Student training twice to get a larger teacher model and then to derive smaller students. This method produces a better model than one produced by just supervised learning or Noisy Student training.
Noisy Student training required data augmentation such as RangAugment for vision and SpecAugment for speech for better performance. Although in applications like natural language processing, such types of input noise are not readily available. In these cases where the Noisy Student training can be simplified not to have any noise, SSD comes as a simpler choice.
SSD follows the following steps:
- The teacher model first infers pseudo-labels on the unlabeled dataset on the unlabeled dataset.
- A new teacher model of an equal or larger size than the original teacher model is trained from here. This is essentially a self-training step
- Finally, knowledge distillation is applied to produce a smaller student model for production.
An interesting use case of SSD is its application in language understanding within Google Search. Google claims it to be the first successful instance of semi-supervised learning being applied at such a large scale and demonstrating the potential impact of such approaches for production-scale systems.
In this use case, the ranking component in Search is used to build on BERT to understand languages better.