MITB Banner

Facebook Gives Away Its Largest Language Database For Free

Share

Facebook’s AI wing recently announced that it will open-source its FLORES-101 database to allow researchers to benefit from the work and use it to improve multilingual translation models. FLORES-101 is a many-to-many evaluation data set which covers 101 different languages. The database is available, along with a tech report and models, here for free use by researchers and developers worldwide. Facebook claims that making such information publicly available will empower researchers to accelerate progress in many-to-many translation systems everywhere. 

According to Facebook, ‘good benchmarks are difficult to construct’ and must display tangible differences between different translation models. Furthermore, such evaluation benchmarks need to maintain a high quality for every language it wishes to reach—especially when functioning translators already exist for popular languages like English, Hindi and Mandarin. The social media giant claims that its open-sourced database will enable developers to generate more diverse and locally relevant translation tools. Facebook has partnered with Dynabench to host evaluations for its FLORES benchmark.   

About FLORES

Source: Facebook

Facebook created the FLORES-101 dataset in a multi-step workflow (as displayed above). Each document is first translated by a professional translator and then verified by a human editor. After this, it goes through an ‘automatic check’ for quality control. This includes spelling checks and edits in grammar, punctuation, formatting, and comparison with translations from commercial engines. Upon completing this, a set of human translators would evaluate the data and pinpoint other errors such as unnatural translations. If too many errors come forth in the human evaluation process, the translations are returned for ‘retranslation’. Otherwise, the translations are ready.

The process described above makes it clear that Facebook’s database has built-in tools to sharpen translation quality. Facebook AI, however, claims that FLORES goes beyond providing work of higher quality. It also focuses on low-resource languages, unlike the majority of available benchmarks. As per Facebook, more than 80 per cent of the languages FLORES uses are currently low-resource. Additionally, FLORES brings content from various literature, including news, travel guides and a diverse genre of books, to reach a larger audience than other translation benchmarks.

The tool also allows models to scan translations at the document level instead of going through individual sentences, generating better models that understand contextual translation. Finally, FLORES also provides supporting information, such as incorporating hyperlinks, images, or URLs, permitting meta-level analysis in its models. 

What is Facebook Up To

Source: Facebook

Facebook claims that the evaluation of different translation systems has been challenging for AI researchers, making benchmark evaluation systems critical for the development of superior translation systems. Moreover, previous solutions to this have mostly been proprietary datasets and have been heavily reliant on translating in and out of English. This makes such data insufficient for fast and precise translation in other (less commonly translated to and from English) languages. This provides around 200 translation directions from which researchers can measure the quality of translations.

FLORES-101 is more flexible than these previous systems because it—as mentioned above—focuses on many languages that do not have much data for natural language processing (NLP) research, such as Swahili and Amharic; and translates the same set of sentences throughout every language. Doing so allows researchers to evaluate the quality of translations through 10,100 different translation directions (e.g. directly from Thai to Urdu or Hindi to Swahili). According to Antonios Anastasopoulos, assistant professor at the Department of Computer Science at George Mason University, FLORES is exceptionally vital since not only does it draws attention to ‘under-served languages,’ it also encourages further research in these languages.

This marks one of many projects undertaken by Facebook to improve speech recognition and translation systems. For one, the tech company has collaborated with the Workshop on Machine Translation (WMT). It will host a Large-Scale Multilingual Translation shared task—the evaluation of which will be based on the FLORES data set. As a part of this task, Facebook has also partnered with Microsoft Azure to offer compute grants for research on low-resource languages. Aspiring applicants can read more and send proposals for this grant here.

Source: Facebook

Other such work from Facebook consists of its advancements in Dynatech, which recently updated the platform with Dynaboard and the wav2vec-U ML model that understands speech without requiring labelled data. Finally, Facebook’s M2M-100 was the first AI model to translate 100 languages without relying on English. Developments in FLORES-101 help build upon multilingual translation models like M2M-100. 

Even in an increasingly globalised world, language proves a serious barrier hindering free access to multitudes of information and communication. Keeping this in mind, Facebook AIs research and its decision to have its work open sourced seem to be a significant step towards building many more bridges than possible earlier. 

Share
Picture of Mita Chaturvedi

Mita Chaturvedi

I am an economics undergrad who loves drinking coffee and writing about technology and finance. I like to play the ukulele and watch old movies when I'm free.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.