Meta’s ‘No Language Left Behind’ 450GB training dataset reproduced & released online

Listen to this story

Meta and AllenNLP researchers have released mined bitext training data for Meta AI’s No Language Left Behind NLLB-200 models. The company aims to facilitate analysis and documentation, enabling others to train on the same data to make fair comparisons. 

This newly released data was the missing piece in being able to fully recreate the dataset used to train NLLB-200. AI researchers can now access the full dataset.

The announcement was made in a Twitter post on Wednesday.

The researchers behind this project are Jesse Dodge, Akshita Bhagia and Kenneth Heafield, along with other researchers at Meta for open sourcing info for this reproduction. 

The new dataset contains bitext for 148 English-centric and 1,465 non-English-centric language pairs using the stopes mining library and the LASER3 encoders. The complete dataset is estimated to be 450 GB of text. 

The dataset structure is made up of gzipped tab delimited text files for each direction, with each text file containing lines with parallel sentences. The data was filtered based on language identification, emoji-based filtering, and implemented language models for some high-resource languages. 

Moreover, the dataset can also be accessed on the data science platform Hugging Face

Based in Seattle, AI2 is a non-profit research institute founded in 2014 that conducts high-impact AI research and engineering services, taking a results-oriented approach to complex challenges.

Download our Mobile App

Bhuvana Kamath
I am fascinated by technology and AI’s implementation in today’s dynamic world. Being a technophile, I am keen on exploring the ever-evolving trends around applied science and innovation.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

6 IDEs Built for Rust

Rust IDEs aid efficient code development by offering features like code completion, syntax highlighting, linting, debugging tools, and code refactoring