Popular Image Datasets Under Scanner: MIT Takes Down One Of Their Own

Share

Published on July 7, 2020

by Ram Sagar

“We are up against a system that has veritably mastered ethics shopping, ethics bluewashing, ethics lobbying, ethics dumping, and ethics shirking.”

The emergence of the ImageNet dataset is widely considered a pivotal moment in the Deep Learning revolution. AlexNet, a Convolutional Neural Network (CNN) with 60 million parameters, sets new benchmarks for computer vision thanks to the vastness of ImageNet.

However, these popular image datasets are now under the scanner. From their sources to labelling and their downstream effects of training, everything is being questioned. In a paper titled, ‘ Large image datasets: A pyrrhic win for computer vision?’, the authors argued whether the success of computer vision has come at the expense of minoritized groups and further aided in the gradual erosion of privacy and consent. This work has even led to MIT taking down one of their popular Tiny Images dataset.

According to the authors, here are a few examples of how inaccurate data can feed up into algorithmic results:

Systematic under-representation women in search results for occupations
Object detection that detects pedestrians with higher error rates for recognition of demographic groups with dark skin tones
Lighter-skin males are classified with the highest accuracy, while darker-skin females suffer the most misclassification.

Be it object detection for self-driving cars or diagnosis of skin cancer, with results such as above, the AI systems can’t be unreliable.

The Problem With Large Scale Image Datasets

The root cause of many issues stems from the foundations of WordNet. The authors call this the ‘WordNet effect’.

For instance, image labelling and validation can have shortcomings such as the single label per-image procedure when the real-world images usually contain multiple objects. Added to this is the overly restrictive label proposals.

ImageNet, wrote the authors, is not the only large scale vision dataset that has inherited the shortcomings of the WordNet taxonomy. The 80 million Tiny Images dataset, which resulted in the CIFAR-datasets, also used the same approach. Unlike ImageNet, TinyImages dataset has never been audited or scrutinized. The lack of scrutiny, warn the authors, might have also resulted in secretive datasets. They cite Clearview AI as an example for employing secretive datasets, which redefines the very meaning of privacy as we know it.

“It [dataset] has been taken offline, and it will not be put back online. We ask the community to refrain from using it in future and also delete any existing copies of the dataset that may have been downloaded.”
– MIT

As a result of this investigation, MIT announced that they are taking down the 80M large Tiny Images dataset for containing some derogatory terms as categories and offensive images. Created in 2006, the dataset contains 53,464 different nouns, which are directly copied from Wordnet.

This work also sheds light on the exploitation in the name of Creative Commons (CC) license, which only addresses the copyright issues and not consents to use images for training. Yet, many of the efforts beyond ImageNet, including the Open Images dataset, have been built on top of the Creative commons loophole that large scale dataset curation agencies interpret as a free for all!

What Can Be Done

“Feeding AI systems on the world’s beauty, ugliness, and cruelty, but expecting it to reflect only the beauty is a fantasy.”

Though the work primarily focuses on the flaws in the large scale datasets, the authors re-emphasize that their investigation specifically focuses on the non-consensual aspect of the images and not on the content of the images itself. However, there can be many parties, both individuals and groups who might be disadvantaged due to lack of awareness.

The authors recommend remedies to these problems. Here are a few:

Employ remove, replace and open strategy

When the practices of data collection are open to the public, then the auditing becomes easier. This, in turn, would help in identifying the problematic sections of the dataset. Once identified, these images should be removed and replaced with images which are collected fairly and with the consent of the concerned parties.

Using Synthetic Data

Data augmentation is a popular technique used to increase the diversity of the training dataset. Another way of implementing augmentation is by using synthetic images instead of real images. The existing GAN based approaches can facilitate this.

Dataset Audit Cards

Much along the lines of model cards, the authors propose dissemination of dataset audit cards. This allows large scale image dataset curators to publish the goals, curation procedures, known shortcomings and caveats alongside their dataset dissemination.

Key Takeaways

The authors, with their work, are trying to build AI systems that are ethical and engage in data collection techniques that are consensual. Here are a few noteworthy points from this survey:

ImageNet, as well as other large image datasets, remain troublesome
Audit cards were presented as a solution to one of the problems
Deeper problems are rooted in the wider structural traditions, incentives, and discourse of a field that treats ethical issues as an afterthought

On a concluding note, the authors hope this work contributes to raising awareness and adds to a continued discussion of ethics and justice in machine learning.

Link to the paper

Access all our open Survey & Awards Nomination forms in one place