Facebook has announced its new machine translation model which can directly translate any pair of 100 languages without using English data. This algorithm, known as M2M-100, has been trained on 2,200 languages, ten times more than the previous best.
As per the company, this model outperformed English-centric multilingual models by 10 points on the BLEU (bilingual evaluation understudy), an algorithm which evaluates the quality of the translated text. The M2M-100 model has been open-sourced.
The ultimate goal of this multilingual machine translator is to build a model that can perform bidirectional translation between 7,000 languages of the world to benefit low-resource languages in particular. The novelty of Facebook’s M2M-100 model lies in the fact that it does not depend on English as a link between two languages. For example, for translation between Chinese and Hindi, typically systems train on Chinese to English and then English to Hindi; however, the M2M-100 model can now directly translate on Chinese to Hindi data to better preserve the original meaning.
How Does M2M-100 Work
Facebook has built a many-to-many data set with 7.5 billion sentences for 100 languages using novel mining techniques. Principally, using several scaling techniques, a universal model with 15 billion parameters has been built. These parameters capture information from related languages and reflect a diverse script of languages and morphology.
In the direction, several data mining resources such as ccAligned, ccMatrix, and LASER were combined. The massive training data set was created by mining ccNET, built on fastText which is a library of word embeddings and text classification; LASER library-based ccMatrix, which embeds sentences in a multilingual embedding space; and ccAligned, a method that aligns documents based on URL. Additionally, in this regard, Facebook has also upgraded the LASER and fastText libraries for improved quality of mining.
Even with improved mining techniques, the translation was still found to be computationally intensive; hence, Facebook prioritised their mining techniques towards translation of languages with the highest quality and largest quantity of data. Statistically rare translation directions such as Icelandic-Nepali or Sinhals-Javanese were omitted.
To connect the languages of different families, a few ‘bridge languages’ were identified. The languages were classified into 14 broad groups based on linguistics, geography, and cultural similarities. Up to three major languages from each of these groups were recognised as bridge languages. For example, languages like Hindi, Urdu, and Bengali were chosen as bridge languages from the Indo-Aryan group of languages. Using these bridge languages, the researchers at Facebook then mined parallel training data for all possible combinations of bridge languages. Out of this technique, 7.5 billion sentences of data were obtained.
In the case of low-resource languages, a technique of back translation was used to supplement data. For example, in the case of Chinese-to-French translation model, first, a model for French to Chinese was created, and all the monolingual French data was translated to create synthetic, back-translated Chinese. This method was found to be particularly effective when translating monolingual sentences into parallel data sets.
The two above mentioned strategies — bridge languages and back-translation helped in improving the performance of 100 bi-directional languages translation by 1.7 BLEU points as compared to conventional translation models which only use mined data. Further, the M2M-100 model delivered impressive results on zero-shot settings where no training data is available for a pair of languages.
Facebook also invited a group of native speakers to judge the quality of translation between 20 language pairs, none of which involved English. As per the participants, the quality of translation and the integrity of the words remained high. However, the model was found to have fallen short when translating slangs where the M2M-100 applied a word-to-word translation due to which the true essence was lost.
Further, the model was found to be susceptible to grammatical issues, which may lead to wrong interpretation. The shortcomings were also acknowledged by Facebook, which noted in the paper detailing the M2M-100 model, “For many languages, we require substantial improvements before reasonable translations can be reliably obtained.”