There are around 7000 languages spoken globally, but most translation models focus on English and other popular languages. This excludes a major part of the world from the benefit of having access to content, technologies and other advantages of being online. Tech giants are trying to bridge this gap. Just days back, Meta announced that it plans to bring out a Universal Speech Translator to translate speech from one language to another in real-time. This announcement is not surprising to anyone who follows the company closely. Meta has been devoted to bringing innovations in machine translations for quite some time now.
Let us take a quick look back into the major highlights of its machine translation journey.
Sign up for your weekly dose of what's up in emerging technology.
Scaling NMT translation
Meta used neural machine translation (NMT) to automatically translate text in posts and comments. NMT models are useful at learning from large-scale monolingual data, and Meta was able to train an NMT model in 32 minutes. This was a drastic reduction in training time from 24 hours.
In 2018, Meta also open-sourced the Language-Agnostic SEntence Representations (LASER) toolkit. It works with over 90 languages that are written in 28 different alphabets. LASER computes multilingual sentence embeddings for zero-shot cross-lingual transfer. It works on low-resource languages as well. Meta said that “LASER achieves these results by embedding all languages jointly in a single shared space (rather than having a separate model for each).”
Wav2vec: unsupervised pre-training for speech recognition
Today, accessing various benefits of technology like GPS, virtual assistants essentially need speech recognition technology. But most of them rely on English, and a major chunk of people who do not speak the language or speak it with an accent not recognisable are excluded from using such an easy and important method of accessing information and services. Wav2vec wanted to solve this. Here, unsupervised pre-training for speech recognition was the focus point for Meta. Wav2vec is trained on unlabeled audio data.
Meta adds, “The wav2vec model is trained by predicting speech units for masked parts of speech audio. It learns basic units that are 25ms long to enable learning of high-level contextualised representations.”
Due to this, Meta has been able to build speech recognition systems that perform way better than best semi-supervised methods, though it can have 100 times less labelled training data.
M2M-100: Multilingual machine translation
2020 was an important year for Meta, where it came out with different models that advanced machine translation technology. M2M-100 was one of them. It is a multilingual machine translation (MMT) model that translates between any pair of 100 languages without depending on English as an intermediary. M2M-100 is trained on a total of 2,200 language directions. This model wants to make the quality of translations worldwide better, especially those who speak low-resource languages claimed Meta.
CoVoST: multilingual speech-to-text translation
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English. What makes it unique is that CoVoST covers over 11,000 speakers and over 60 accents. Meta claims that it is “the first end-to-end many-to-one multilingual model for spoken language translation.”
FLORES 101: low-resource languages
Following M2M-100’s footsteps, in the first half of last year, Meta open-sourced FLORES-101. It is a many-to-many evaluation data set that covers 101 languages globally with a focus on low-resource languages that lack extensive datasets even now. Meta added, “FLORES-101 is the missing piece, the tool that enables researchers to rapidly test and improve upon multilingual translation models like M2M-100.”
In 2022, Meta released data2vec, calling it “the first high-performance self-supervised algorithm that works for multiple modalities.” It was applied to speech, text and images separately and it outperformed the previous best single-purpose algorithms for computer vision and speech. Data2vec does not rely on contrastive learning or reconstructing the input example.