Facebook AI Research (FAIR) recently trained a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages. The motive behind this research is to simplify the overall deployment of ASR systems that support diverse languages.
The use of multilingual ASR systems that have the capability to simultaneously transcribe multiple languages has gained much traction over the few years. However, a single model that is capable of recognising hundreds of different languages simultaneously has been a long-term motive in the domain of automatic speech recognition (ASR).
There are several automatic speech recognition systems such as end-to-end models, multi-lingual sequence-to-sequence models, among others. The drawbacks of these systems are that they study a limited number of languages, which are typically less than 10 languages or include limited data, such as readings of the Bible.
According to the researchers, this new speech recognition system overcomes such issues and is claimed to be the first one to study multilingual systems at a massive scale, covering 51 languages and more than 16,000 hours of audio.
Behind The System
In this project, the researchers demonstrated that it is possible to train a massive single automatic speech recognition (ASR) architecture for 51 different languages, which has been witnessed as considerably less time-consuming to tune than 51 different monolingual baselines.
The researchers compared three variants of multilingual training that comprises a sequence to sequence (Seq2Seq) model, a single joint model and a multi-headed model.
Seq2Seq Model
A sequence-to-sequence model comprises two neural networks, which are an encoder and a decoder. In this model, the sequences of one domain are converted into sequences of another domain.
According to the researchers, in this model, the decoder of the model shares the parameters between languages which do not have any common graphemes. This, in result, is unlikely to improve the recognition performance of any of the languages.
Joint Model
The joint model approach in this research is a single model which is trained while sharing the parameters from the encoder, decoder and token set, across all languages.
Multi-headed Model
The multi-headed models employ a single encoder whose parameters are shared across all languages. This model overcomes the problem of seq-to-seq model and helps in choosing the appropriate decoder based on the language.
Dataset Used
The training dataset used in this research consists of videos that are publicly shared by users and span a total of 51 languages. The dataset is anonymised before it’s used. The languages are categorised into three categories, which are:
- High resource languages that consist of more than 600 hours of train data,
- Mid resource language with around 300-500 hours of training data and
- Low resource languages with around 100-150 hrs of training data
Benefits of This ASR
Some of the benefits of this ASR are mentioned below: –
- The researchers showed that multilingual training of ASR models in several languages can improve recognition performance, particularly in low resource languages.
- Having a single model for all languages can simplify the production pipeline significantly.
- Training multilingual ASR models on a small set of similar languages can improve recognition performance.
Contributions of this Research
The researchers at FAIR made several contributions to this research. Some of them are mentioned below: –
- The researchers trained the ASR model on 51 languages from several language families.
- They showed that a joint model with shared vocabulary approach can surpass strong monolingual baselines on low resource languages.
- They proposed a refined multi-head approach, where each head addresses a set of similar languages and improves on the monolithic joint model approach, leading to competitive results.
- Further, the researchers demonstrated that the multilingual model learns representations general enough that it improves monolingual baseline WER on new languages which were unknown during the initial training phase.
Read the paper here.