Active Hackathon

Popular Image Datasets Under Scanner: MIT Takes Down One Of Their Own

“We are up against a system that has veritably mastered ethics shopping, ethics bluewashing, ethics lobbying, ethics dumping, and ethics shirking.” 

The emergence of the ImageNet dataset is widely considered a pivotal moment in the Deep Learning revolution. AlexNet, a Convolutional Neural Network (CNN) with 60 million parameters, sets new benchmarks for computer vision thanks to the vastness of ImageNet.

However, these popular image datasets are now under the scanner. From their sources to labelling and their downstream effects of training, everything is being questioned. In a paper titled, ‘ Large image datasets: A pyrrhic win for computer vision?’, the authors argued whether the success of computer vision has come at the expense of minoritized groups and further aided in the gradual erosion of privacy and consent. This work has even led to MIT taking down one of their popular Tiny Images dataset.


Sign up for your weekly dose of what's up in emerging technology.

According to the authors, here are a few examples of how inaccurate data can feed up into algorithmic results:

  • Systematic under-representation women in search results for occupations
  • Object detection that detects pedestrians with higher error rates for recognition of demographic groups with dark skin tones
  • Lighter-skin males are classified with the highest accuracy, while darker-skin females suffer the most misclassification.

Be it object detection for self-driving cars or diagnosis of skin cancer, with results such as above, the AI systems can’t be unreliable.

The Problem With Large Scale Image Datasets

The root cause of many issues stems from the foundations of WordNet. The authors call this the ‘WordNet effect’.

For instance, image labelling and validation can have shortcomings such as the single label per-image procedure when the real-world images usually contain multiple objects. Added to this is the overly restrictive label proposals. 

ImageNet, wrote the authors, is not the only large scale vision dataset that has inherited the shortcomings of the WordNet taxonomy. The 80 million Tiny Images dataset, which resulted in the CIFAR-datasets, also used the same approach. Unlike ImageNet, TinyImages dataset has never been audited or scrutinized. The lack of scrutiny, warn the authors, might have also resulted in secretive datasets. They cite Clearview AI as an example for employing secretive datasets, which redefines the very meaning of privacy as we know it.

“It [dataset] has been taken offline, and it will not be put back online. We ask the community to refrain from using it in future and also delete any existing copies of the dataset that may have been downloaded.”


As a result of this investigation, MIT announced that they are taking down the 80M large Tiny Images dataset for containing some derogatory terms as categories and offensive images. Created in 2006, the dataset contains 53,464 different nouns, which are directly copied from Wordnet.

This work also sheds light on the exploitation in the name of Creative Commons (CC) license, which only addresses the copyright issues and not consents to use images for training. Yet, many of the efforts beyond ImageNet, including the Open Images dataset, have been built on top of the Creative commons loophole that large scale dataset curation agencies interpret as a free for all!

What Can Be Done

“Feeding AI systems on the world’s beauty, ugliness, and cruelty, but expecting it to reflect only the beauty is a fantasy.”

Though the work primarily focuses on the flaws in the large scale datasets, the authors re-emphasize that their investigation specifically focuses on the non-consensual aspect of the images and not on the content of the images itself. However, there can be many parties, both individuals and groups who might be disadvantaged due to lack of awareness. 

The authors recommend remedies to these problems. Here are a few:

Employ remove, replace and open strategy

When the practices of data collection are open to the public, then the auditing becomes easier. This, in turn, would help in identifying the problematic sections of the dataset. Once identified, these images should be removed and replaced with images which are collected fairly and with the consent of the concerned parties.

Using Synthetic Data

Data augmentation is a popular technique used to increase the diversity of the training dataset. Another way of implementing augmentation is by using synthetic images instead of real images. The existing GAN based approaches can facilitate this.

Dataset Audit Cards

Much along the lines of model cards, the authors propose dissemination of dataset audit cards. This allows large scale image dataset curators to publish the goals, curation procedures, known shortcomings and caveats alongside their dataset dissemination.

Key Takeaways

The authors, with their work, are trying to build AI systems that are ethical and engage in data collection techniques that are consensual. Here are a few noteworthy points from this survey:

  • ImageNet, as well as other large image datasets, remain troublesome
  • Audit cards were presented as a solution to one of the problems
  • Deeper problems are rooted in the wider structural traditions, incentives, and discourse of a field that treats ethical issues as an afterthought

On a concluding note, the authors hope this work contributes to raising awareness and adds to a continued discussion of ethics and justice in machine learning.

Link to the paper

More Great AIM Stories

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Our Upcoming Events

Conference, Virtual
Genpact Analytics Career Day
3rd Sep

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

Council Post: How to Evolve with Changing Workforce

The demand for digital roles is growing rapidly, and scouting for talent is becoming more and more difficult. If organisations do not change their ways to adapt and alter their strategy, it could have a significant business impact.

All Tech Giants: On your Mark, Get Set – Slow!

In September 2021, the FTC published a report on M&As of five top companies in the US that have escaped the antitrust laws. These were Alphabet/Google, Amazon, Apple, Facebook, and Microsoft.

The Digital Transformation Journey of Vedanta

In the current digital ecosystem, the evolving technologies can be seen both as an opportunity to gain new insights as well as a disruption by others, says Vineet Jaiswal, chief digital and technology officer at Vedanta Resources Limited

BlenderBot — Public, Yet Not Too Public

As a footnote, Meta cites access will be granted to academic researchers and people affiliated to government organisations, civil society groups, academia and global industry research labs.