Google Releases Wikipedia-Based Image Text (WIT) Dataset

Wikimedia Research, in partnership with Google and other external collaborators, is hosting a competition with the WIT dataset in Kaggle.
Google Releases Wikipedia-Based Image Text (WIT) Dataset

Google recently released a Wikipedia-Based Image Text (WIT) dataset, a large multimodal dataset created by extracting various text selections associated with an image from Wikimedia image links and articles. It was conducted by rigorous filtering to retain high-quality image-text sets. 

The WIT dataset is available for download on GitHub

As part of its initiatives to address Wikipedia’s knowledge gaps, Wikimedia Research, in partnership with Google and other external collaborators such as EPFL, Naver Labs Europe, and Hugging Face, is hosting a competition with the WIT dataset in Kaggle. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Check out the details of the Wikipedia – Image/Caption Matching challenge here

Addressing the Real-World Dataset Challenge 

To model the relationship between text and images, multimodal Visio-linguistic models rely on rich datasets. Traditionally, these datasets have been created by manually captioning images or crawling the web and extracting the alt-text as the caption. 

The former approach tends to result in higher quality data, whereas the latter limits the amount of data that can be generated/created. While the automated extraction approach leads to bigger datasets, these require heuristics and careful filtering to scale models and ensure data quality to achieve strong performance. 

Another challenge of existing datasets is the shortage of coverage in non-English languages. To solve these issues, Google researchers developed the WIT dataset, aiming to create a high-quality, large-sized, multilingual dataset with a variety of content. 

WIT vs Other Datasets

As explained in ‘WIT: Wikipedia-based image text dataset for multimodal multilingual machine learning,’ the dataset resulted in a curated set of 37.5 million entity-rich image-text examples, along with 11.5 million unique images across 108 languages, as presented at SIGIR 2021. SIGIR is a premier scientific conference in the area of information retrieval. 

WIT increased language coverage and large size compared to previous datasets. 

Dataset  Images  Text  Contextual Text Languages 
Flickr30K32K158K< 8
SBU Captions1M1M1
MS-COCO330K1.5M< 4; 7 (test only)
(Source: Google)

Here are some of the benefits of the WIT dataset: 

  • Size: It is one of the largest multimodal datasets of image-text examples that is publicly available or open-sourced. 
  • Multilingual: WIT has 10x or more languages than any other dataset (108 languages). 
  • Contextual information: WIT includes many page-level and section-level contextual information, unlike typical multimodal datasets, which have only one caption per image. 
  • Real-world entities: As a broad knowledge base, Wikipedia is rich with real-world entities that are represented in WIT. 
  • Challenging test set: All SOTA models demonstrated significantly lower performance on WIT compared to standard evaluation sets. (Example: ~30 point drop in recall

The Ideation of WIT 

Google researchers said that the main goal was to create a large dataset without compromising the quality or coverage of concepts/ideas. Hence, they started using the largest online encyclopedia available today — Wikipedia

To give you an example, consider the Wikipedia page for ‘Half Dome (Yosemite National Park, CA).’ As shown below, the article has various interesting text captions and relevant contextual information for the image, including page title, main page description, and other contextual information and metadata. 

(Source: Google/Wikipedia page for Half Dome: Photo by DAVID ILIFF)
Showcasing Wikipedia page for the image of Half Dome (Source: Google/Wikipedia page for Half Dome: Photo by DAVID ILIFF)

Here’s how they did it 

The researchers said they started by selecting Wikipedia pages with images and then extracted various image-text associations and surrounding contexts. Then, further refining the data, the researchers performed a rigorous filtering process to ensure data quality. This included text-based filtering, caption availability, length and quality (for example, removing generic default filter text); image-based filtering to ensure each image is a certain size with permissible licensing; and image-and-text-entity-based filtering to ensure suitability for research (like excluding those classified as hate speech). 

Further, the researchers randomly sampled image-caption sets for evaluation by human editors, who verified that 98 per cent of the samples had good image-caption alignment. 

Kaggle Competition 

The competition involves an image-text retrieval task. With the help of images and text captions, the user participants need to retrieve the appropriate caption(s) for each image. 

To enable research in this area, Wikipedia has made images available at 300-pixel resolution and a ResNet-50-based image embedding for most training and test datasets. In addition to the WIT dataset, Kaggle will be hosting all this image data and providing Colab notebooks. 

Further, the participants will have access to a discussion forum in Kaggle to share code and collaborate, enabling anyone interested in multimodality to get started and run experiments seamlessly. 

Wrapping up 

Google believes that the WIT dataset will help researchers build better multimodal multilingual models and identify better representation techniques, leading to improved machine learning models in real-world tasks over Visio-linguistic data. 

Amit Raja Naik
Amit Raja Naik is a seasoned technology journalist who covers everything from data science to machine learning and artificial intelligence for Analytics India Magazine, where he examines the trends, challenges, ideas, and transformations across the industry.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox