Google recently released a Wikipedia-Based Image Text (WIT) dataset, a large multimodal dataset created by extracting various text selections associated with an image from Wikimedia image links and articles. It was conducted by rigorous filtering to retain high-quality image-text sets.
The WIT dataset is available for download on GitHub.
As part of its initiatives to address Wikipedia’s knowledge gaps, Wikimedia Research, in partnership with Google and other external collaborators such as EPFL, Naver Labs Europe, and Hugging Face, is hosting a competition with the WIT dataset in Kaggle.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Check out the details of the Wikipedia – Image/Caption Matching challenge here.
Addressing the Real-World Dataset Challenge
To model the relationship between text and images, multimodal Visio-linguistic models rely on rich datasets. Traditionally, these datasets have been created by manually captioning images or crawling the web and extracting the alt-text as the caption.
The former approach tends to result in higher quality data, whereas the latter limits the amount of data that can be generated/created. While the automated extraction approach leads to bigger datasets, these require heuristics and careful filtering to scale models and ensure data quality to achieve strong performance.
Another challenge of existing datasets is the shortage of coverage in non-English languages. To solve these issues, Google researchers developed the WIT dataset, aiming to create a high-quality, large-sized, multilingual dataset with a variety of content.
WIT vs Other Datasets
As explained in ‘WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning,’ the dataset resulted in a curated set of 37.5 million entity-rich image-text examples, along with 11.5 million unique images across 108 languages, as presented at SIGIR 2021. SIGIR is a premier scientific conference in the area of information retrieval.
WIT increased language coverage and large size compared to previous datasets.
Dataset | Images | Text | Contextual Text | Languages |
Flickr30K | 32K | 158K | – | < 8 |
SBU Captions | 1M | 1M | – | 1 |
MS-COCO | 330K | 1.5M | – | < 4; 7 (test only) |
CC-3M | 3.3M | 3.3M | – | 1 |
CC-12M | 12M | 12M | – | 1 |
WIT | 11.5M | 37.5M | ~119M | 108 |
Here are some of the benefits of the WIT dataset:
- Size: It is one of the largest multimodal datasets of image-text examples that is publicly available or open-sourced.
- Multilingual: WIT has 10x or more languages than any other dataset (108 languages).
- Contextual information: WIT includes many page-level and section-level contextual information, unlike typical multimodal datasets, which have only one caption per image.
- Real-world entities: As a broad knowledge base, Wikipedia is rich with real-world entities that are represented in WIT.
- Challenging test set: All SOTA models demonstrated significantly lower performance on WIT compared to standard evaluation sets. (Example: ~30 point drop in recall)
The Ideation of WIT
Google researchers said that the main goal was to create a large dataset without compromising the quality or coverage of concepts/ideas. Hence, they started using the largest online encyclopedia available today — Wikipedia.
To give you an example, consider the Wikipedia page for ‘Half Dome (Yosemite National Park, CA).’ As shown below, the article has various interesting text captions and relevant contextual information for the image, including page title, main page description, and other contextual information and metadata.
Here’s how they did it
The researchers said they started by selecting Wikipedia pages with images and then extracted various image-text associations and surrounding contexts. Then, further refining the data, the researchers performed a rigorous filtering process to ensure data quality. This included text-based filtering, caption availability, length and quality (for example, removing generic default filter text); image-based filtering to ensure each image is a certain size with permissible licensing; and image-and-text-entity-based filtering to ensure suitability for research (like excluding those classified as hate speech).
Further, the researchers randomly sampled image-caption sets for evaluation by human editors, who verified that 98 per cent of the samples had good image-caption alignment.
Kaggle Competition
The competition involves an image-text retrieval task. With the help of images and text captions, the user participants need to retrieve the appropriate caption(s) for each image.
To enable research in this area, Wikipedia has made images available at 300-pixel resolution and a ResNet-50-based image embedding for most training and test datasets. In addition to the WIT dataset, Kaggle will be hosting all this image data and providing Colab notebooks.
Further, the participants will have access to a discussion forum in Kaggle to share code and collaborate, enabling anyone interested in multimodality to get started and run experiments seamlessly.
Wrapping up
Google believes that the WIT dataset will help researchers build better multimodal multilingual models and identify better representation techniques, leading to improved machine learning models in real-world tasks over Visio-linguistic data.