ImageNet is one of the most popular image datasets organized according to the WordNet hierarchy. ImageNet was created with the objective to build tens of millions of cleanly sorted images for most of the concepts.
- Total number of images: 14,197,122
- Number of images with bounding box annotations: 1,034,908
Today, almost every state-of-the-art image recognition model is pre-trained on the ImageNet database. The universality of ImageNet makes one wonder if it is worth the praise.
Also Read: Image Classification Benchmarking
To investigate this, the researchers from Google, in their new work, ran experiments to check if the progress on the ImageNet classification benchmark is as good as what it is considered to be as. The authors check the benchmark for meaningful generalization and if the users have started to overfit the nuances of the labels in the ImageNet database.
The researchers at Google developed a more robust procedure for collecting human annotations of the ImageNet validation set. Using these new labels, they have reassessed the accuracy of recently proposed ImageNet classifiers. To their surprise, they found that the newly reported gains are substantially smaller than those reported on the original labels.
The researchers demonstrated the inefficiencies of using ImageNet. Then proceeded to develop a new labelling procedure and also introduced a new metric called ReaL accuracy to assess the new labels.
Where Does ImageNet Fall Short And What Can Be Done
As illustrated above, Red indicates original ImageNet label, green indicates the proposed ReaL labels. The authors show that even when a single object is present, ImageNet labels present systematic inaccuracies due to their labelling procedure.
The authors came across two shortcomings of the ImageNet labelling procedure. ImageNet images are annotated by assigning a single label to images instead of multiple labels.
To address this problem, the authors proposed the following:
- Use a training objective which allows models to emit multiple non-exclusive predictions for a given image.
- Treat the multi-way classification problem as a set of independent binary classification problems.
- Penalise each one with a sigmoid cross-entropy loss, which does not enforce mutually exclusive predictions.
The authors observed that recent top-performing models surpass the original ImageNet labels in their ability to predict human preferences. To filter the noise, the researchers have used BiT-L model to clean the ImageNet training set.
First, the training images are split into 10 equally-sized folds. And, one fold is excluded, and the BiT-L model is trained on the remaining 9 folds. The resulting model from this training is then made to predict labels on the hold-out fold. Images with labels that are inconsistent with BiT-L’s predictions are removed.
While exposing the reduced usefulness of the ImageNet benchmark, the authors also proposed a solution to improve the results for ImageNet.
They observed that long training schedules can be a hindrance in the presence of noisy data. So, they believe that cleaning the ImageNet training set will yield additional benefits.
The authors draw the following insights from their experiments:
- Training on clean ImageNet data consistently improves accuracy
- Using sigmoid loss resulted in consistent accuracy improvements across all ResNet architectures
Though they admitted that ImageNet dataset is a landmark achievement in the evaluation of machine learning techniques. They underline some limitations. One of them being a single label description of images instead of images containing multiple objects. Even for images containing a single object, biases in the collection procedure can lead to systematic inaccuracies. Also, some ImageNet classes even draw distinctions between identical groups of images. So, the researchers stress that there is a need for a new human annotation procedure.
This work investigated whether recent progress on the ImageNet benchmark amounts to meaningful generalisation. Using “Reassessed Labels” (ReaL), the researchers have found that the association between progress on ImageNet and ReaL progress has weakened over time. The findings from this work can be summarised as follows:
- Recent top models have begun to surpass the original ImageNet labels in terms of their ReaL accuracy, indicating that we may be approaching the end of their usefulness.
- ImageNet usefulness in evaluating vision models may be nearing an end.
- ImageNet can be improved with cleaner data and by using sigmoid loss.
Link to paper.