Facebook Gives Away Its Largest Language Database For Free

Facebook’s AI wing recently announced that it will open-source its FLORES-101 database to allow researchers to benefit from the work and use it to improve multilingual translation models. FLORES-101 is a many-to-many evaluation data set which covers 101 different languages. The database is available, along with a tech report and models, here for free use by researchers and developers worldwide. Facebook claims that making such information publicly available will empower researchers to accelerate progress in many-to-many translation systems everywhere. 

According to Facebook, ‘good benchmarks are difficult to construct’ and must display tangible differences between different translation models. Furthermore, such evaluation benchmarks need to maintain a high quality for every language it wishes to reach—especially when functioning translators already exist for popular languages like English, Hindi and Mandarin. The social media giant claims that its open-sourced database will enable developers to generate more diverse and locally relevant translation tools. Facebook has partnered with Dynabench to host evaluations for its FLORES benchmark.   


Source: Facebook

Facebook created the FLORES-101 dataset in a multi-step workflow (as displayed above). Each document is first translated by a professional translator and then verified by a human editor. After this, it goes through an ‘automatic check’ for quality control. This includes spelling checks and edits in grammar, punctuation, formatting, and comparison with translations from commercial engines. Upon completing this, a set of human translators would evaluate the data and pinpoint other errors such as unnatural translations. If too many errors come forth in the human evaluation process, the translations are returned for ‘retranslation’. Otherwise, the translations are ready.

The process described above makes it clear that Facebook’s database has built-in tools to sharpen translation quality. Facebook AI, however, claims that FLORES goes beyond providing work of higher quality. It also focuses on low-resource languages, unlike the majority of available benchmarks. As per Facebook, more than 80 per cent of the languages FLORES uses are currently low-resource. Additionally, FLORES brings content from various literature, including news, travel guides and a diverse genre of books, to reach a larger audience than other translation benchmarks.

The tool also allows models to scan translations at the document level instead of going through individual sentences, generating better models that understand contextual translation. Finally, FLORES also provides supporting information, such as incorporating hyperlinks, images, or URLs, permitting meta-level analysis in its models. 

What is Facebook Up To

Source: Facebook

Facebook claims that the evaluation of different translation systems has been challenging for AI researchers, making benchmark evaluation systems critical for the development of superior translation systems. Moreover, previous solutions to this have mostly been proprietary datasets and have been heavily reliant on translating in and out of English. This makes such data insufficient for fast and precise translation in other (less commonly translated to and from English) languages. This provides around 200 translation directions from which researchers can measure the quality of translations.

FLORES-101 is more flexible than these previous systems because it—as mentioned above—focuses on many languages that do not have much data for natural language processing (NLP) research, such as Swahili and Amharic; and translates the same set of sentences throughout every language. Doing so allows researchers to evaluate the quality of translations through 10,100 different translation directions (e.g. directly from Thai to Urdu or Hindi to Swahili). According to Antonios Anastasopoulos, assistant professor at the Department of Computer Science at George Mason University, FLORES is exceptionally vital since not only does it draws attention to ‘under-served languages,’ it also encourages further research in these languages.

This marks one of many projects undertaken by Facebook to improve speech recognition and translation systems. For one, the tech company has collaborated with the Workshop on Machine Translation (WMT). It will host a Large-Scale Multilingual Translation shared task—the evaluation of which will be based on the FLORES data set. As a part of this task, Facebook has also partnered with Microsoft Azure to offer compute grants for research on low-resource languages. Aspiring applicants can read more and send proposals for this grant here.

Source: Facebook

Other such work from Facebook consists of its advancements in Dynatech, which recently updated the platform with Dynaboard and the wav2vec-U ML model that understands speech without requiring labelled data. Finally, Facebook’s M2M-100 was the first AI model to translate 100 languages without relying on English. Developments in FLORES-101 help build upon multilingual translation models like M2M-100. 

Even in an increasingly globalised world, language proves a serious barrier hindering free access to multitudes of information and communication. Keeping this in mind, Facebook AIs research and its decision to have its work open sourced seem to be a significant step towards building many more bridges than possible earlier. 

More Great AIM Stories

Mita Chaturvedi
I am an economics undergrad who loves drinking coffee and writing about technology and finance. I like to play the ukulele and watch old movies when I'm free.

More Stories


8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>


3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM