What Exactly Is Google AI’s MURAL: Multimodal, Multi-task Retrieval Across Languages

MURAL uses multitask learning applied to image–text pairs in combination with translation pairs covering over 100 languages.

There are around 7000 languages spoken in this world, and often, there is no direct one-to-one translation from one language to another. Even if such translations exist, they may not be exactly accurate, and different associations and connotations can be easily lost for a non-native speaker. This issue can be resolved by presenting a text paired with a supporting image. But, such image–text pair data does not exist for most languages. This type of data mostly comes for highly-resourced languages like English and Chinese.

To address this, Google AI has released the “MURAL: Multimodal, Multitask Representations Across Languages” model for image–text matching. It uses multitask learning applied to image–text pairs in combination with translation pairs covering over 100 languages.

In the paper titled, “MURAL: Multimodal, Multitask Retrieval Across Languages“, the research team says they have explored dual encoder learning from both image-caption and translation pairs at a large scale (6 billion translation pairs and 1.8 billion image caption pairs). Doing this was a big issue before as multilingual image-text datasets such as Multi30k, STAIR, and XTD support only high-resource languages. But, the recent Wikipedia Image-Text (WIT) dataset addresses this problem by covering 108 languages.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

MURAL Architecture

The architecture is based on the structure of ALIGN, but employed in a multitask fashion, says Google. MURAL uses a dual-encoder architecture to draw together representations of images and associated text descriptions and also extends it across various languages by incorporating translation pairs. Google adds that the translation pairs are those used for LaBSE while the dataset of image-text pairs is the same as used for ALIGN. For under-resourced languages like Hindi, MURAL shows improved retrieval performance compared to ALIGN. For Image→Text retrieval in a well-resourced language like French, MURAL shows better understanding for some words.

Image: Google AI

Training Dataset

The researchers state in the paper, “MURAL: Multimodal, Multitask Retrieval Across Languages,” that the training datasets used are:

  • Conceptual 12M (CC12M) – A publicly available image captioning dataset in English. It has 12 million pairs obtained from web images and their corresponding alt-text descriptions.
  • The multilingual version of Alt-Text with 1.8 billion images and their alt-text descriptions, covering 110 languages.
  • The team created an Ensemble of Open Bilingual Translation (EOBT) Pairs dataset by combining publicly available datasets. The EOBT has approximately 500 million pairs across all languages.

Evaluation Datasets

Here, the research team used:

  • Flickr30K – 31k images and comes with five English captions per image. Multi30K extends Flickr30k with German, French, and Czech captions.
  • MS-COCO has five human-generated English captions per image. 
  • The STAIR database adds human crowdsourced Japanese captions for MSCOCO images.
  • XTD – Test-only set with seven well-resourced languages: Italian, Spanish, Russian, Chinese, Polish, Turkish, and Korean.
  • Wikipedia Image Text (WIT) dataset spreading over 108 languages. 
  • Crisscrossed Captions extends the English MSCOCO 5k dev and test sets with human similarity annotations for both intra- and inter-modal tasks.  

Google AI says that previous researchers in this area have shown interesting connections among languages. The research team found a similar visualization for a subset of languages that belong to the Germanic, Slavic, Uralic, Finnic, Celtic, and Finno-Ugric language families.


The Google AI team says the results show that ALIGN is improved by adding a bitext ranking objective. It added that the latter matches zero-shot image-text retrieval performance on well-resourced languages. It improves performance on under-resourced languages significantly. 

Image: Google AI (Mean recall on various multilingual image–text retrieval benchmarks)

For XTD, MURAL improves recall@10 by 4% on average. On WIT zero-shot, MURAL improves mean recall by 1.7% on average for nine well-resourced languages and by 8.1% for eight under-resourced ones. After fine-tuning on WIT, MURAL mean recall is 1.8% and 6.8% better than ALIGN, on average, for well-resourced and under-resourced languages, respectively.

Sreejani Bhattacharyya
I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at sreejani.bhattacharyya@analyticsindiamag.com

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox