Now Reading
Google Releases Wikipedia-Based Image Text (WIT) Dataset

Google Releases Wikipedia-Based Image Text (WIT) Dataset

  • Wikimedia Research, in partnership with Google and other external collaborators, is hosting a competition with the WIT dataset in Kaggle.
Google Releases Wikipedia-Based Image Text (WIT) Dataset

Google recently released a Wikipedia-Based Image Text (WIT) dataset, a large multimodal dataset created by extracting various text selections associated with an image from Wikimedia image links and articles. It was conducted by rigorous filtering to retain high-quality image-text sets. 

The WIT dataset is available for download on GitHub

Register for our Workshop on How To Start Your Career In Data Science?

As part of its initiatives to address Wikipedia’s knowledge gaps, Wikimedia Research, in partnership with Google and other external collaborators such as EPFL, Naver Labs Europe, and Hugging Face, is hosting a competition with the WIT dataset in Kaggle. 

Check out the details of the Wikipedia – Image/Caption Matching challenge here

Addressing the Real-World Dataset Challenge 

To model the relationship between text and images, multimodal Visio-linguistic models rely on rich datasets. Traditionally, these datasets have been created by manually captioning images or crawling the web and extracting the alt-text as the caption. 

The former approach tends to result in higher quality data, whereas the latter limits the amount of data that can be generated/created. While the automated extraction approach leads to bigger datasets, these require heuristics and careful filtering to scale models and ensure data quality to achieve strong performance. 

Another challenge of existing datasets is the shortage of coverage in non-English languages. To solve these issues, Google researchers developed the WIT dataset, aiming to create a high-quality, large-sized, multilingual dataset with a variety of content. 

WIT vs Other Datasets

As explained in ‘WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning,’ the dataset resulted in a curated set of 37.5 million entity-rich image-text examples, along with 11.5 million unique images across 108 languages, as presented at SIGIR 2021. SIGIR is a premier scientific conference in the area of information retrieval. 

WIT increased language coverage and large size compared to previous datasets. 

Dataset  Images  Text  Contextual Text Languages 
Flickr30K32K158K< 8
SBU Captions1M1M1
MS-COCO330K1.5M< 4; 7 (test only)
CC-3M3.3M3.3M1
CC-12M12M12M1
WIT11.5M37.5M~119M108
(Source: Google)

Here are some of the benefits of the WIT dataset: 

  • Size: It is one of the largest multimodal datasets of image-text examples that is publicly available or open-sourced. 
  • Multilingual: WIT has 10x or more languages than any other dataset (108 languages). 
  • Contextual information: WIT includes many page-level and section-level contextual information, unlike typical multimodal datasets, which have only one caption per image. 
  • Real-world entities: As a broad knowledge base, Wikipedia is rich with real-world entities that are represented in WIT. 
  • Challenging test set: All SOTA models demonstrated significantly lower performance on WIT compared to standard evaluation sets. (Example: ~30 point drop in recall

The Ideation of WIT 

Google researchers said that the main goal was to create a large dataset without compromising the quality or coverage of concepts/ideas. Hence, they started using the largest online encyclopedia available today — Wikipedia

To give you an example, consider the Wikipedia page for ‘Half Dome (Yosemite National Park, CA).’ As shown below, the article has various interesting text captions and relevant contextual information for the image, including page title, main page description, and other contextual information and metadata. 

See Also

(Source: Google/Wikipedia page for Half Dome: Photo by DAVID ILIFF)
Showcasing Wikipedia page for the image of Half Dome (Source: Google/Wikipedia page for Half Dome: Photo by DAVID ILIFF)

Here’s how they did it 

The researchers said they started by selecting Wikipedia pages with images and then extracted various image-text associations and surrounding contexts. Then, further refining the data, the researchers performed a rigorous filtering process to ensure data quality. This included text-based filtering, caption availability, length and quality (for example, removing generic default filter text); image-based filtering to ensure each image is a certain size with permissible licensing; and image-and-text-entity-based filtering to ensure suitability for research (like excluding those classified as hate speech). 

Further, the researchers randomly sampled image-caption sets for evaluation by human editors, who verified that 98 per cent of the samples had good image-caption alignment. 

Kaggle Competition 

The competition involves an image-text retrieval task. With the help of images and text captions, the user participants need to retrieve the appropriate caption(s) for each image. 

To enable research in this area, Wikipedia has made images available at 300-pixel resolution and a ResNet-50-based image embedding for most training and test datasets. In addition to the WIT dataset, Kaggle will be hosting all this image data and providing Colab notebooks. 

Further, the participants will have access to a discussion forum in Kaggle to share code and collaborate, enabling anyone interested in multimodality to get started and run experiments seamlessly. 

Wrapping up 

Google believes that the WIT dataset will help researchers build better multimodal multilingual models and identify better representation techniques, leading to improved machine learning models in real-world tasks over Visio-linguistic data. 


Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join our Telegram Group. Be part of an engaging community

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top