What Exactly Is Google AI’s MURAL: Multimodal, Multi-task Retrieval Across Languages

There are around 7000 languages spoken in this world, and often, there is no direct one-to-one translation from one language to another. Even if such translations exist, they may not be exactly accurate, and different associations and connotations can be easily lost for a non-native speaker. This issue can be resolved by presenting a text paired with a supporting image. But, such image–text pair data does not exist for most languages. This type of data mostly comes for highly-resourced languages like English and Chinese.

To address this, Google AI has released the “MURAL: Multimodal, Multitask Representations Across Languages” model for image–text matching. It uses multitask learning applied to image–text pairs in combination with translation pairs covering over 100 languages.

In the paper titled, “MURAL: Multimodal, Multitask Retrieval Across Languages“, the research team says they have explored dual encoder learning from both image-caption and translation pairs at a large scale (6 billion translation pairs and 1.8 billion image caption pairs). Doing this was a big issue before as multilingual image-text datasets such as Multi30k, STAIR, and XTD support only high-resource languages. But, the recent Wikipedia Image-Text (WIT) dataset addresses this problem by covering 108 languages.


Sign up for your weekly dose of what's up in emerging technology.

MURAL Architecture

The architecture is based on the structure of ALIGN, but employed in a multitask fashion, says Google. MURAL uses a dual-encoder architecture to draw together representations of images and associated text descriptions and also extends it across various languages by incorporating translation pairs. Google adds that the translation pairs are those used for LaBSE while the dataset of image-text pairs is the same as used for ALIGN. For under-resourced languages like Hindi, MURAL shows improved retrieval performance compared to ALIGN. For Image→Text retrieval in a well-resourced language like French, MURAL shows better understanding for some words.

Image: Google AI

Download our Mobile App

Training Dataset

The researchers state in the paper, “MURAL: Multimodal, Multitask Retrieval Across Languages,” that the training datasets used are:

  • Conceptual 12M (CC12M) – A publicly available image captioning dataset in English. It has 12 million pairs obtained from web images and their corresponding alt-text descriptions.
  • The multilingual version of Alt-Text with 1.8 billion images and their alt-text descriptions, covering 110 languages.
  • The team created an Ensemble of Open Bilingual Translation (EOBT) Pairs dataset by combining publicly available datasets. The EOBT has approximately 500 million pairs across all languages.

Evaluation Datasets

Here, the research team used:

  • Flickr30K – 31k images and comes with five English captions per image. Multi30K extends Flickr30k with German, French, and Czech captions.
  • MS-COCO has five human-generated English captions per image. 
  • The STAIR database adds human crowdsourced Japanese captions for MSCOCO images.
  • XTD – Test-only set with seven well-resourced languages: Italian, Spanish, Russian, Chinese, Polish, Turkish, and Korean.
  • Wikipedia Image Text (WIT) dataset spreading over 108 languages. 
  • Crisscrossed Captions extends the English MSCOCO 5k dev and test sets with human similarity annotations for both intra- and inter-modal tasks.  

Google AI says that previous researchers in this area have shown interesting connections among languages. The research team found a similar visualization for a subset of languages that belong to the Germanic, Slavic, Uralic, Finnic, Celtic, and Finno-Ugric language families.


The Google AI team says the results show that ALIGN is improved by adding a bitext ranking objective. It added that the latter matches zero-shot image-text retrieval performance on well-resourced languages. It improves performance on under-resourced languages significantly. 

Image: Google AI (Mean recall on various multilingual image–text retrieval benchmarks)

For XTD, MURAL improves recall@10 by 4% on average. On WIT zero-shot, MURAL improves mean recall by 1.7% on average for nine well-resourced languages and by 8.1% for eight under-resourced ones. After fine-tuning on WIT, MURAL mean recall is 1.8% and 6.8% better than ALIGN, on average, for well-resourced and under-resourced languages, respectively.

Support independent technology journalism

Get exclusive, premium content, ads-free experience & more

Rs. 299/month

Subscribe now for a 7-day free trial

More Great AIM Stories

Sreejani Bhattacharyya
I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at sreejani.bhattacharyya@analyticsindiamag.com

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

All you need to know about Graph Embeddings

Embeddings can be the subgroups of a group, similarly, in graph theory embedding of a graph can be considered as a representation of a graph on a surface, where points of that surface are made up of vertices and arcs are made up of edges