Will this be the end of ImageNet as we know it or the beginning of something great? Earlier this week, researchers from the University of Oxford released an ImageNet replacement for a self-supervised pre-training dataset called PASS, or Picture without humAns for Self-Supervision.
ImageNet is one of the most popular datasets used in computer vision and machine learning applications today. It contains more than 14 million images across 20,000 different categories and has 27 high-level subcategories, containing at least 500 images each. Unfortunately, all the images in the ImageNet dataset are manually labelled, and there exist privacy issues.
Sign up for your weekly dose of what's up in emerging technology.
Besides this, ImageNet and other large datasets like PASCAL VOC, MS COCO, etc., sampled from the Internet for pre-training models have ethical and technical shortcomings, like containing personal information taken without consent, unclear license usage, biases, and in a few cases, even problematic image content – to name a few.
Coming to the rescue, PASS does not include any humans and can be used for high-quality pre-training while significantly reducing privacy concerns.
Making ImageNet Obsolete
In recent times, SOTA pre-training has been obtained with unsupervised methods, meaning that labelled datasets such as ImageNet may not be necessary, or perhaps not even optimal, for model pre-training. PASS is one such unlabelled dataset that contains images with Creative Common (CC-BY) license and complete attribution metadata, addressing the copyright issue.
The best part is, it does not contain images of people at all. Instead, it avoids other types of images that are problematic for data protection or ethics.
ImageNet Vs PASS
Charting out the difference between ImageNet and PASS, the researchers extensively evaluated SSL methods on the given dataset and discussed performance differences when trained using ImageNet and PASS.
Accordingly, the team noted three essential differences – the lack of class-level curation and search, the lack of ‘community optimisation’, and the lack of humans.
Further, studying via ablation the contribution of these effects, the researchers found out that:
- Self-supervised techniques like MoCO, SwAV and DINO train well on their dataset, yielding strong image representations.
- Excluding images with humans during pre-training has almost no effect on downstream task performances, even if this is done in ImageNet.
- In 8/13 frozen encoder evaluation benchmarks, the performance of models trained on PASS yields better results than pre-training on ImageNet, ImageNet without humans, or Places205, when transferred to other datasets.
- It also yields better results for fine-tuning evaluations, such as detection and segmentation – PASS pre-training yields results within ±1% mAP and AP50 on COCO.
- Even on tasks involving humans, such as dense pose prediction, pre-training on their dataset yields performance on par with ImageNet pre-training.
In addition to this, the researchers also showed that PASS is used for pre-training with methods like MoCo-v2, SwAV and DINO. Moreover, in transfer learning settings, it yields similar downstream performance to ImageNet pre-training even on tasks that involve humans like ‘human pose estimation.’
Advantages of PASS
PASS does not make existing datasets obsolete. However, it shows that model pre-training is often possible while using safer data, and it also provides the basis for a more robust evaluation of pre-training techniques.
Plus, the researchers believe that their latest dataset, PASS, has been proposed more from a technical, ethical and legal standpoint. In brief,
- By using CC-BY license images, we can greatly reduce the risk of using images in a manner incompatible with copyright.
- By avoiding the usage of search engines and labels to form a dataset, we avoid introducing corresponding biases.
- By excluding all images that contain humans and other identifiable information and NSFW images, we significantly reduce data protection and other ethical risks for the data subjects.
In addition to this, the researchers show that they can effectively pre-train neural networks using this data. For instance, extensive downstream task evaluations have shown that pretrained networks obtained using self-supervised training on their dataset are competitive to ImageNet on transfer settings and even downstream tasks involving humans.
Limitations of PASS
Amid these advantages, there are several limitations as well. Firstly, while the researchers put care in filtering the images, automatically and manually, they said that some harmful content might have slipped through.
Secondly, sampling images randomly from a large uncurated collection removes specific biases such as search engine selection, but not others, such as geographic bias. Finally, since there are no people in these pictures, PASS is used to learn models of people, such as for pose detection and pose recognition.
Thirdly, as PASS contains no labels, it cannot be used alone for training and benchmarking. Here, curated datasets remain necessary, which continue to carry many privacy and copyright issues.
Oxford researchers believe that its unlabelled dataset PASS is an important step towards curating and improving datasets to reduce ethical and legal risks for many tasks and applications, and at the same time to challenge the SSL community with a new, more realistic training scenario of utilising images not contained from a labelled/annotated dataset.