Last updated January 18, 2023
In AI News & Update

Microsoft Unveils NTREX, a New Dataset for Machine Translation

NTREX aims to bridge the language divide with 128 languages, each having 2000 sentences.

Published on January 18, 2023

by Shritama Saha

Listen to this story

Microsoft Research announced the launch of NTREX, the second largest human-translated parallel test set, featuring 128 languages, each having 2000 sentences translated with a document context without post-editing.

NTREX, a data set containing “News Text References of English into X Languages”, expands multilingual testing for translating 123 documents (1,997 sentences, 42k words) from English into 128 target languages. The test data is based on WMT19 and compatible with SacreBLEU.

Read the full paper here.

It can be used to evaluate English-sourced translation models but not in the reverse direction. The test set release also introduces another benchmark for evaluating massively multilingual machine translation research.

To produce this data set, the team sent the original English WMT19 test set to professional human translators. This work started after the release of the WMT19 test data and has continued in parallel with the work on new translation models since then. Translators could access the full document context.

The team compared the NTREX-128 data set with COMET-src, a neural framework for MT evaluation, for the authentic translation direction against the scores obtained in the reverse direction. They also investigated how COMET-src behaves for languages it has yet to be trained.

Microsoft Research revealed the following results:

Using COMET-src for test quality estimation is feasible but constrained due to the non-comparability of score ranges across language pairings.
A significant subset of languages sees COMET-src scores on translationese input performed than the corresponding authentic source data.
Although COMET-src relative comparisons are valid across all language pairings, there is a subset of languages for which the scores seem faulty.

The data set consists of the following set of 128 languages: Afrikaans, Albanian, Amharic, Arabic, Azerbaijani, Bangla, Bashkir, Bosnian, Bulgarian, Burmese, Cantonese, Catalan, Central Kurdish, Chinese, Chuvash, Croatian, Czech, Danish, Dari, Divehi, Dutch, English, Estonian, Faroese, Fijian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hebrew, Hindi, Hmong, Hungarian, Icelandic, Indonesian, Inuinnaqtun, Inuktitut, Irish, isiZulu, Italian, Japanese, Kannada, Kazakh, Khmer, Kiswahili, Korean, Kurdish, Kyrgyz, Lao, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Maya, Yucatán, Mongolian, Nepali, Norwegian, Odia, Otomi, Querétaro, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Samoan, Serbian, Slovak, Slovenian, Somali, Spanish, Swedish, Tahitian, Tajik, Tajiki, Tamil, Tatar, Telugu, Thai, Tibetan, Tigrinya, Tongan, Turkish, Turkmen, Ukrainian, Upper Sorbian, Urdu, Uyghur, Uzbek, Vietnamese, Welsh.

The total count of language names is less than 128, as there are some languages for which multiple scripts or variants are supported.

The number of supported languages for three multilingual test data sets, TICO-19, FLORES-101, and FLORES-200, is 37,101,200, respectively.

The “Translation Initiative for Covid-19” released the TICO-19 dataset. It was a collaborative endeavour between several academic and industrial partners. The benchmark consists of 30 documents translated into 37 target languages from English (3,071 sentences, 69.7k words).

Meta also unveiled their open-source AI model—’ No Language Left Behind‘ (NLLB-200), capable of providing high-quality translations across 200 different languages, validated through extensive evaluations. Meta developed data set FLORES-101 with 3,001 sentences in 842 documents translated from English into 101 target languages. FLORES-200 expands FLORES-101 to 200 target languages and can assess NLLB-200’s performance. The same English source data that FLORES-101 is used to create FLORES-200.

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

Shritama Saha

Shritama (she/her) is a technology journalist at AIM who is passionate to explore the influence of AI on different domains including fashion, healthcare and banks.

GPT-5 Likely to be Released After the US Elections

Microsoft Unveils VASA-1, Creating DeepFake Videos with a Single Image

Microsoft Renews Funding for IWill GITA, World’s First Gen-AI Hindi Mental Health Program

Microsoft Invests $1.5 billion in UAE-based AI Company G42

Not All Tokens Are What You Need, Say Microsoft Researchers

Zoho Collaborates with Intel to Optimise & Accelerate Video AI Workloads

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

India is Making its Own AI Servers

Pritam Bordoloi

PLI scheme marks the beginning of India ‘s manufacturing venture

Generative AI Jobs in India can Fetch You up to Rs 1 Crore

Siddharth Jindal

Infosys Feels Good About Its Work with Generative AI

Mohit Pandey

Top Editorial Picks

Elon Musk Set to Meet Indian Spacetech Startups During Upcoming Visit

Shyam Nandan Upadhyay

Happiest Minds Technologies Acquires Macmillan Learning India, Expands Edutech Reach

Shritama Saha

Meta Releases Llama 3, Beats Claude 3 Sonnet and Gemini Pro 1.5

Mohit Pandey

Nothing Becomes the First Smartphone Company to Integrate OpenAI’s ChatGPT

Siddharth Jindal

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Featured

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Through the implementation of advanced data management methodologies, resilient data observability solutions, and cutting-edge AI frameworks, Course5 is spearheading the