MITB Banner

NVIDIA Develops Automatic Speech Recognition Model for Telugu

Share

Listen to this story

Telugu is one of the country’s most commonly spoken languages, with more than 75 million speakers in southern India. In the US, Telugu population was 644,700 in 2020, and it is considered to be the third most spoken Indian language in the country. 

However, it is still considered to be one of the low-resource languages in terms of conversational AI. This is because there need to be more hours of datasets for building AI models for automatic speech recognition (ASR) in Telugu. Thus, there remains huge scope for improving translation and transcription in Telugu and other regional languages. 

In the past few years, deep learning has demonstrated remarkable advancements in the field of machine translation (MT), and the translation is now approaching human translation quality. A decent MT model needs to be trained on millions of translated sentences, yet gathering this data is expensive. There are thousands of languages spoken worldwide, yet the majority of language pairings currently need training. Earlier, NVIDIA also came up with translation models of Tamil to English and Inuktitut to English. 

The NVIDIA speech AI team used their NeMo framework for creating and training cutting-edge conversational AI models to create an ASR model for Telugu. The model emerged as the winner in competitions held by IIIT-Hyderabad and Telugu ASR challenge. With word error rates of roughly 13% and 12% for the closed and open tracks, respectively, NVIDIA NeMo-powered models outperformed all other models created using well-known ASR frameworks like ESPnet, Kaldi, SpeechBrain, and others by a significant margin.

“What sets NVIDIA NeMo apart is that we open source all of the models we have — so people can easily fine-tune the models and do transfer learning on them for their use cases,” said Nithin Koluguri, a senior AI research scientist at NVIDIA. “NeMo is also one of the only toolkits that support scaling training to multi-GPU systems and multi-node clusters.”

Developing the Telugu ASR Model

In the initial stage of creating the model, the data were preprocessed. Then, for the competition’s closed track, Koluguri and his colleague Megh Makwana, an applied deep learning solution architect manager at NVIDIA, cleaned up the speech dataset by removing incorrect letters and punctuation.

The team removed sentences with a larger than-30 character rate, which measures characters spoken per second, and cut audio files shorter than 20 seconds and shorter than 1 second. The ASR model was then trained using NeMo for 160 epochs, or complete cycles of the dataset’s 120 million parameters. 

The team employed models that had been pre-trained with 36,000 hours of data on all 40 of India’s official languages for the competition’s open track. It took about three days to fine-tune this model using an NVIDIA DGX system for Telugu. The competition’s organisers were then given the results of the inference test. NVIDIA took first place with almost 2% fewer word errors than the runner-up. 

NVIDIA Promotes Speech AI for Low-Resource Languages

The majority of Indian languages are low-resource, which means that there is a dearth of data for training NLP systems, particularly conversational systems, in these languages. 

Previously, the Indian government introduced Project Bhashni, which intends to provide simple access to the internet and digital services in their local languages. AI4Bharat, backed by Microsoft’s Research Lab and India Development Center (IDC), is an open-source research lab for Indian languages, which offers “unrestricted research grants” for developing open-source technologies.

“ASR is gaining a lot of momentum in India majorly because it will allow digital platforms to onboard and engage with billions of citizens through speech-assistance services,” said Makwana.

The method used to create the Telugu model can be used in any language. However, 90% of the roughly 7,000 languages spoken in the globe, or 3 billion speakers, are regarded as low resources for speech AI. Accents, pidgins, and dialects are excluded from this. 

One method NVIDIA is enhancing linguistic inclusiveness in the field of speech AI is by opening and sourcing every one of its models on the NeMo toolkit. Additionally, the NVIDIA Riva software development kit now includes pre-trained speech AI models in 10 languages, with numerous future language extensions anticipated.

Share
Picture of Shritama Saha

Shritama Saha

Shritama (she/her) is a technology journalist at AIM who is passionate to explore the influence of AI on different domains including fashion, healthcare and banks.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.