Listen to this story
Meta and AllenNLP researchers have released mined bitext training data for Meta AI’s No Language Left Behind NLLB-200 models. The company aims to facilitate analysis and documentation, enabling others to train on the same data to make fair comparisons.
This newly released data was the missing piece in being able to fully recreate the dataset used to train NLLB-200. AI researchers can now access the full dataset.
The announcement was made in a Twitter post on Wednesday.
The researchers behind this project are Jesse Dodge, Akshita Bhagia and Kenneth Heafield, along with other researchers at Meta for open sourcing info for this reproduction.
The new dataset contains bitext for 148 English-centric and 1,465 non-English-centric language pairs using the stopes mining library and the LASER3 encoders. The complete dataset is estimated to be 450 GB of text.
The dataset structure is made up of gzipped tab delimited text files for each direction, with each text file containing lines with parallel sentences. The data was filtered based on language identification, emoji-based filtering, and implemented language models for some high-resource languages.
Moreover, the dataset can also be accessed on the data science platform Hugging Face.
Based in Seattle, AI2 is a non-profit research institute founded in 2014 that conducts high-impact AI research and engineering services, taking a results-oriented approach to complex challenges.