Active Hackathon

Meta’s ‘No Language Left Behind’ 450GB training dataset reproduced & released online

Listen to this story

Meta and AllenNLP researchers have released mined bitext training data for Meta AI’s No Language Left Behind NLLB-200 models. The company aims to facilitate analysis and documentation, enabling others to train on the same data to make fair comparisons. 

This newly released data was the missing piece in being able to fully recreate the dataset used to train NLLB-200. AI researchers can now access the full dataset.


Sign up for your weekly dose of what's up in emerging technology.

The announcement was made in a Twitter post on Wednesday.

The researchers behind this project are Jesse Dodge, Akshita Bhagia and Kenneth Heafield, along with other researchers at Meta for open sourcing info for this reproduction. 

The new dataset contains bitext for 148 English-centric and 1,465 non-English-centric language pairs using the stopes mining library and the LASER3 encoders. The complete dataset is estimated to be 450 GB of text. 

The dataset structure is made up of gzipped tab delimited text files for each direction, with each text file containing lines with parallel sentences. The data was filtered based on language identification, emoji-based filtering, and implemented language models for some high-resource languages. 

Moreover, the dataset can also be accessed on the data science platform Hugging Face

Based in Seattle, AI2 is a non-profit research institute founded in 2014 that conducts high-impact AI research and engineering services, taking a results-oriented approach to complex challenges.

More Great AIM Stories

Bhuvana Kamath
I am fascinated by technology and AI’s implementation in today’s dynamic world. Being a technophile, I am keen on exploring the ever-evolving trends around applied science and innovation.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM