Can Hindi be Introduced in GPT-3?

We are still in infancy, and to enable more companies to build solutions for mass consumption, it is important that we start building these open-source datasets: Raghu Dharmaraju
Listen to this story

At Google for India 2022, came the announcement that AI & Robotics Technology Park (ARTPARK), set up by the Indian Institute of Science (IISc), has teamed up with Google to unveil an all-India inclusive language data initiative for open-sourcing datasets. 

The collaboration, known as Project Vaani, intends to gather extensive datasets of spoken languages and transcribed texts from every district in India. To promote research and innovation, these datasets are open-sourced via Vaani’s website and may soon also be available through other platforms like Bhashini of the Ministry of Electronics and Information Technology (MeitY).

Bhashini was another attempt by the Indian government to make AI and NLP (natural language processing) resources available to the public, startups and developers with the thought that it might provide an edge to the development of inclusive internet, which gives Indians an easy access to the internet in their native languages. 

The recent Project Vaani has joined the SYSPIN (Synthesizing Speech in Indian Languages) and RESPIN (Recognizing Speech in Indian Languages) programmes under the Bhāshā AI umbrella of ARTPARK and IISc, which already encompass nine languages including Magadhi and Maithili. 

Google and IISc plan to collect speech samples from 773 districts of India under Project Vaani. The project, in order to boost the size and diversity of India’s open-sourced language data, aims to collect over 150,000 hours of curated speech and 100 million sentences of text in Indian scripts. One of the goals of ARTPARK is to build applications in areas such as health, agriculture, and financial inclusion using these datasets.

Hindi in GPT-3?

However, when such huge datasets are being collected for Indian languages, can it be possible for developers to build an LLM (large language model) purely based on Hindi? To discuss it, Analytics India Magazine reached out to Raghu Dharmaraju, president, ARTPARK.

When asked why India lags when it comes to LLMs, Dharmaraju was of the opinion that not everything needs LLMs. Additionally, for specific narrow domains, specific models can be trained and may perform better than a generic LLMs, not to mention costs and ability to work in low-compute or low-bandwidth environments. 

“But even ignoring the LLM approach, India lags behind on data that is good enough for training. The lack of data is perhaps most acute in spoken language – the way we speak daily with friends and family and in our community. But even in formal speech or text, it’s still low,” said Dharmaraju. 

“Even for a language like Hindi, which is spoken by a large population in India, we have very little data that really captures the diversity of how it changes inter- and intra-state.”

But having data doesn’t solve the problem either, as per Dharmaraju, there are many factors that comes into play. For example, “The data should be discoverable for researchers and application builders like startups. Then they need the infrastructure, including computational power to train models. And ultimately the right infrastructure and capabilities to deploy these models. Indian government and industry recognise this well actually, and are making progress together,” said Dharmaraju. 

Challenges devs face with Indian languages

While it is refreshing to see big corporations and governments introduce inclusive internet for Indian users, there are several problems developers may face while working with Indian languages. For instance, if a researcher is developing new AI/ML models in English, there are various types of data available. In fact, for an application developer in English, there are many offerings from companies and even some open-sourced AI/ML models, so the developer may get away without having to build models totally from the ground up. 

“Indian languages don’t have that,” said Dharmaraju. “We are still in infancy. And to enable many more companies to build solutions for mass consumption, it is important that we start building these open-sourced datasets and perhaps even some open-sourced foundational AI models for further research and innovation, now!”

Why is Google interested in Indic languages?

If there is one question that comes to mind after the recent Google for India conference, is what is Google’s stake in introducing inclusive internet for Indians? 

Google for one, is very vocal for its next billion users initiative. As per Google, every week millions of people come online for the first time. “Everyone — no matter their location, language or digital literacy — deserves an internet that was made for them. The initiative conducts research and builds products for people around the world,” says its webpage

As per Dharmaraju, there is an overall increased realisation on the part of everyone – government, industry, and research – regarding the importance of language AI for Indian languages because digitisation of all kinds of services is happening all around us. 

“Covid accelerated this shift in healthcare. UPI in payments. This digitisation has increased the sense of urgency about the need for language AI that can truly understand our diversity of languages – how we speak, and write,” said Dharmaraju. 

Additionally, he said that while many people access the internet (over 700M in India), very few, perhaps only about 50M are truly able to utilise it for their own progress. “And through our collective experience we have realised that the ability to interact in one’s own language can play a big role there,” quipped Dharmaraju. 

Also, do check out:

Download our Mobile App

Lokesh Choudhary
Tech-savvy storyteller with a knack for uncovering AI's hidden gems and dodging its potential pitfalls. 'Navigating the world of tech', one story at a time. You can reach me at:

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Bangalore

Future Ready | Lead the AI Era Summit

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

20th June | Bangalore

Women in Data Science (WiDS) by Intuit India

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Can Apple Save Meta?

The iPhone kicked off the smartphone revolution and saved countless companies. Could the Pro Reality headset do the same for Meta?