Listen to this story
At Google for India 2022, came the announcement that AI & Robotics Technology Park (ARTPARK), set up by the Indian Institute of Science (IISc), has teamed up with Google to unveil an all-India inclusive language data initiative for open-sourcing datasets.
The collaboration, known as Project Vaani, intends to gather extensive datasets of spoken languages and transcribed texts from every district in India. To promote research and innovation, these datasets are open-sourced via Vaani’s website and may soon also be available through other platforms like Bhashini of the Ministry of Electronics and Information Technology (MeitY).
Bhashini was another attempt by the Indian government to make AI and NLP (natural language processing) resources available to the public, startups and developers with the thought that it might provide an edge to the development of inclusive internet, which gives Indians an easy access to the internet in their native languages.
The recent Project Vaani has joined the SYSPIN (Synthesizing Speech in Indian Languages) and RESPIN (Recognizing Speech in Indian Languages) programmes under the Bhāshā AI umbrella of ARTPARK and IISc, which already encompass nine languages including Magadhi and Maithili.
Google and IISc plan to collect speech samples from 773 districts of India under Project Vaani. The project, in order to boost the size and diversity of India’s open-sourced language data, aims to collect over 150,000 hours of curated speech and 100 million sentences of text in Indian scripts. One of the goals of ARTPARK is to build applications in areas such as health, agriculture, and financial inclusion using these datasets.
Hindi in GPT-3?
However, when such huge datasets are being collected for Indian languages, can it be possible for developers to build an LLM (large language model) purely based on Hindi? To discuss it, Analytics India Magazine reached out to Raghu Dharmaraju, president, ARTPARK.
When asked why India lags when it comes to LLMs, Dharmaraju was of the opinion that not everything needs LLMs. Additionally, for specific narrow domains, specific models can be trained and may perform better than a generic LLMs, not to mention costs and ability to work in low-compute or low-bandwidth environments.
“But even ignoring the LLM approach, India lags behind on data that is good enough for training. The lack of data is perhaps most acute in spoken language – the way we speak daily with friends and family and in our community. But even in formal speech or text, it’s still low,” said Dharmaraju.
“Even for a language like Hindi, which is spoken by a large population in India, we have very little data that really captures the diversity of how it changes inter- and intra-state.”
But having data doesn’t solve the problem either, as per Dharmaraju, there are many factors that comes into play. For example, “The data should be discoverable for researchers and application builders like startups. Then they need the infrastructure, including computational power to train models. And ultimately the right infrastructure and capabilities to deploy these models. Indian government and industry recognise this well actually, and are making progress together,” said Dharmaraju.
Challenges devs face with Indian languages
While it is refreshing to see big corporations and governments introduce inclusive internet for Indian users, there are several problems developers may face while working with Indian languages. For instance, if a researcher is developing new AI/ML models in English, there are various types of data available. In fact, for an application developer in English, there are many offerings from companies and even some open-sourced AI/ML models, so the developer may get away without having to build models totally from the ground up.
“Indian languages don’t have that,” said Dharmaraju. “We are still in infancy. And to enable many more companies to build solutions for mass consumption, it is important that we start building these open-sourced datasets and perhaps even some open-sourced foundational AI models for further research and innovation, now!”
Why is Google interested in Indic languages?
If there is one question that comes to mind after the recent Google for India conference, is what is Google’s stake in introducing inclusive internet for Indians?
Google for one, is very vocal for its next billion users initiative. As per Google, every week millions of people come online for the first time. “Everyone — no matter their location, language or digital literacy — deserves an internet that was made for them. The initiative conducts research and builds products for people around the world,” says its webpage.
As per Dharmaraju, there is an overall increased realisation on the part of everyone – government, industry, and research – regarding the importance of language AI for Indian languages because digitisation of all kinds of services is happening all around us.
“Covid accelerated this shift in healthcare. UPI in payments. This digitisation has increased the sense of urgency about the need for language AI that can truly understand our diversity of languages – how we speak, and write,” said Dharmaraju.
Additionally, he said that while many people access the internet (over 700M in India), very few, perhaps only about 50M are truly able to utilise it for their own progress. “And through our collective experience we have realised that the ability to interact in one’s own language can play a big role there,” quipped Dharmaraju.
Also, do check out: