There are around 7000 languages spoken in this world, and often, there is no direct one-to-one translation from one language to another. Even if such translations exist, they may not be exactly accurate, and different associations and connotations can be easily lost for a non-native speaker. This issue can be resolved by presenting a text paired with a supporting image. But, such image–text pair data does not exist for most languages. This type of data mostly comes for highly-resourced languages like English and Chinese.
To address this, Google AI has released the “MURAL: Multimodal, Multitask Representations Across Languages” model for image–text matching. It uses multitask learning applied to image–text pairs in combination with translation pairs covering over 100 languages.
In the paper titled, “MURAL: Multimodal, Multitask Retrieval Across Languages“, the research team says they have explored dual encoder learning from both image-caption and translation pairs at a large scale (6 billion translation pairs and 1.8 billion image caption pairs). Doing this was a big issue before as multilingual image-text datasets such as Multi30k, STAIR, and XTD support only high-resource languages. But, the recent Wikipedia Image-Text (WIT) dataset addresses this problem by covering 108 languages.
MURAL Architecture
The architecture is based on the structure of ALIGN, but employed in a multitask fashion, says Google. MURAL uses a dual-encoder architecture to draw together representations of images and associated text descriptions and also extends it across various languages by incorporating translation pairs. Google adds that the translation pairs are those used for LaBSE while the dataset of image-text pairs is the same as used for ALIGN. For under-resourced languages like Hindi, MURAL shows improved retrieval performance compared to ALIGN. For Image→Text retrieval in a well-resourced language like French, MURAL shows better understanding for some words.
Image: Google AI
Training Dataset
The researchers state in the paper, “MURAL: Multimodal, Multitask Retrieval Across Languages,” that the training datasets used are:
- Conceptual 12M (CC12M) – A publicly available image captioning dataset in English. It has 12 million pairs obtained from web images and their corresponding alt-text descriptions.
- The multilingual version of Alt-Text with 1.8 billion images and their alt-text descriptions, covering 110 languages.
- The team created an Ensemble of Open Bilingual Translation (EOBT) Pairs dataset by combining publicly available datasets. The EOBT has approximately 500 million pairs across all languages.
Evaluation Datasets
Here, the research team used:
- Flickr30K – 31k images and comes with five English captions per image. Multi30K extends Flickr30k with German, French, and Czech captions.
- MS-COCO has five human-generated English captions per image.
- The STAIR database adds human crowdsourced Japanese captions for MSCOCO images.
- XTD – Test-only set with seven well-resourced languages: Italian, Spanish, Russian, Chinese, Polish, Turkish, and Korean.
- Wikipedia Image Text (WIT) dataset spreading over 108 languages.
- Crisscrossed Captions extends the English MSCOCO 5k dev and test sets with human similarity annotations for both intra- and inter-modal tasks.
Google AI says that previous researchers in this area have shown interesting connections among languages. The research team found a similar visualization for a subset of languages that belong to the Germanic, Slavic, Uralic, Finnic, Celtic, and Finno-Ugric language families.
Results
The Google AI team says the results show that ALIGN is improved by adding a bitext ranking objective. It added that the latter matches zero-shot image-text retrieval performance on well-resourced languages. It improves performance on under-resourced languages significantly.
Image: Google AI (Mean recall on various multilingual image–text retrieval benchmarks)
For XTD, MURAL improves recall@10 by 4% on average. On WIT zero-shot, MURAL improves mean recall by 1.7% on average for nine well-resourced languages and by 8.1% for eight under-resourced ones. After fine-tuning on WIT, MURAL mean recall is 1.8% and 6.8% better than ALIGN, on average, for well-resourced and under-resourced languages, respectively.