Listen to this story
|
Telugu is one of the country’s most commonly spoken languages, with more than 75 million speakers in southern India. In the US, Telugu population was 644,700 in 2020, and it is considered to be the third most spoken Indian language in the country.
However, it is still considered to be one of the low-resource languages in terms of conversational AI. This is because there need to be more hours of datasets for building AI models for automatic speech recognition (ASR) in Telugu. Thus, there remains huge scope for improving translation and transcription in Telugu and other regional languages.
In the past few years, deep learning has demonstrated remarkable advancements in the field of machine translation (MT), and the translation is now approaching human translation quality. A decent MT model needs to be trained on millions of translated sentences, yet gathering this data is expensive. There are thousands of languages spoken worldwide, yet the majority of language pairings currently need training. Earlier, NVIDIA also came up with translation models of Tamil to English and Inuktitut to English.
The NVIDIA speech AI team used their NeMo framework for creating and training cutting-edge conversational AI models to create an ASR model for Telugu. The model emerged as the winner in competitions held by IIIT-Hyderabad and Telugu ASR challenge. With word error rates of roughly 13% and 12% for the closed and open tracks, respectively, NVIDIA NeMo-powered models outperformed all other models created using well-known ASR frameworks like ESPnet, Kaldi, SpeechBrain, and others by a significant margin.
“What sets NVIDIA NeMo apart is that we open source all of the models we have — so people can easily fine-tune the models and do transfer learning on them for their use cases,” said Nithin Koluguri, a senior AI research scientist at NVIDIA. “NeMo is also one of the only toolkits that support scaling training to multi-GPU systems and multi-node clusters.”
Developing the Telugu ASR Model
In the initial stage of creating the model, the data were preprocessed. Then, for the competition’s closed track, Koluguri and his colleague Megh Makwana, an applied deep learning solution architect manager at NVIDIA, cleaned up the speech dataset by removing incorrect letters and punctuation.
The team removed sentences with a larger than-30 character rate, which measures characters spoken per second, and cut audio files shorter than 20 seconds and shorter than 1 second. The ASR model was then trained using NeMo for 160 epochs, or complete cycles of the dataset’s 120 million parameters.
The team employed models that had been pre-trained with 36,000 hours of data on all 40 of India’s official languages for the competition’s open track. It took about three days to fine-tune this model using an NVIDIA DGX system for Telugu. The competition’s organisers were then given the results of the inference test. NVIDIA took first place with almost 2% fewer word errors than the runner-up.
NVIDIA Promotes Speech AI for Low-Resource Languages
The majority of Indian languages are low-resource, which means that there is a dearth of data for training NLP systems, particularly conversational systems, in these languages.
Previously, the Indian government introduced Project Bhashni, which intends to provide simple access to the internet and digital services in their local languages. AI4Bharat, backed by Microsoft’s Research Lab and India Development Center (IDC), is an open-source research lab for Indian languages, which offers “unrestricted research grants” for developing open-source technologies.
“ASR is gaining a lot of momentum in India majorly because it will allow digital platforms to onboard and engage with billions of citizens through speech-assistance services,” said Makwana.
The method used to create the Telugu model can be used in any language. However, 90% of the roughly 7,000 languages spoken in the globe, or 3 billion speakers, are regarded as low resources for speech AI. Accents, pidgins, and dialects are excluded from this.
One method NVIDIA is enhancing linguistic inclusiveness in the field of speech AI is by opening and sourcing every one of its models on the NeMo toolkit. Additionally, the NVIDIA Riva software development kit now includes pre-trained speech AI models in 10 languages, with numerous future language extensions anticipated.