“Machine learning models need to be trained with large amounts of transcribed speech audio. This process is effective but expensive.”
Decades ago, asking for basic directions or information would have been an arduous task. The advent of solutions such as Google Translate alleviated that problem to an extent. But, is this enough? For example, India alone is home to 23 official languages and thousands of unofficial ones and Google Translate supports just 11 of India’s languages. Other speech recognition technologies might allow even fewer languages. Also, languages such as Basque and Swahili are far likelier to have more limited AI speech recognition capabilities than Hindi, English and Mandarin.
Hence, the paths opened up by speech recognition technology are only available to a small fraction of the countless languages spoken all over the world. This is because most AI for speech recognition belongs to a category called supervised learning. Machine learning systems in such technologies need to be trained with large amounts of transcribed speech audio. This process is effective but expensive and needs to be repeated for each language. This data is not available on such a scale—if at all—for every language and dialect spoken on Earth.
Sign up for your weekly dose of what's up in emerging technology.
Facebook finds a way
In a recent blog post, Facebook revealed its new AI-based speech recognition technology, wav2vec-Unsupervised (or wav2vec-U), which aims to solve the problems posed by transcribing such languages. This is a method by which individuals could build speech recognition systems that do not require transcribed data.
The ML algorithm still requires some form of training. Wav2vec-U is trained purely through recorded speech audio and unpaired text. This method entails first learning the structure of the target language’s speech from unlabelled audio. Using wav2vec 2.0, Facebook’s self-supervised speech recognition model, and a k-means clustering algorithm, wav2vec-U segments the voice recording into speech units loosely based on individual sounds. For instance, the word cat would correspond to the sounds: “/K/”, “/AE/”, and “/T/”. This allows it to comprehend the structure of this speech.
To recognise the words in an audio recording, Facebook will use a generative adversarial network (GAN) consisting of a generator and a discriminator network. The generator will take each audio segment embedded in self-supervised representations and predict a phoneme, i.e. a sound unit, corresponding to a sound in language. The discriminator would then decide whether the predicted phoneme sequence looks realistic. The initial transcriptions, as per Facebook, will be extremely poor but will improve—via the discriminator’s feedback—over time.
The discriminator, which is also a neural network, will be fed with the actual text from various sources that have been phonemised. This will allow it to learn to distinguish between the speech recognition output of the generator and real text. The generator predicted the sound to phonetically write ‘Tow are uu’, which we know is nonsensical. The discriminator, however, uses its knowledge of real-world phonemised text to make this sound legible, in which it ultimately succeeds.
Facebook has conducted various tests to see how well wav2vec-U works. Facebook led the first evaluation on the TIMIT Acoustic-Phonetic Continuous Speech Corpus, a standard dataset used to evaluate automatic speech recognition systems. Here, Facebook’s model was said to have reduced error rates by 57 per cent compared to the following best-unsupervised method. Facebook also tested the wav2vec-U alongside supervised models on the broader Librispeech benchmark and found wav2vec-U as accurate as effective models from only a few years ago—even though it used no labelled training data.
Source: Facebook AI
Finally, Facebook also tried their method on languages other than English because unsupervised speech recognition would be most impactful for languages without much-labelled data. The results displayed the effectiveness of wav2vec-U even with languages such as Kyrgyz and Tartar, which do not have too many data resources.
Wav2vec-U is the result of years of Facebook AI’s research on speech recognition and unsupervised machine translation. Facebook believes that the system would allow more people to benefit from speech technology. Facebook, however, has admitted that a lot more research is needed to address concerns such as possible bias prevalent in this system, especially regarding racial and gender stereotypes. Facebook has stated that it has not yet investigated this in their model but has said it might help avoid biases introduced through data labelling. Facebook is also releasing the code for wav2vec-U in open source to allow more developers to build speech recognition systems without the need for labelled data.