Listen to this story
|
In 1950, Jawaharlal Nehru, India’s first prime minister, initiated the National Sample Survey to gather granular data on India’s economy. In 1953, the Hindustan Times dubbed it “the biggest and most comprehensive sampling inquiry ever undertaken in any country in the world”.
Over centuries, India has demonstrated its expertise in conducting successful population censuses. The decennial census of India, too, is often regarded as one of the biggest data collection exercises in the world.
India’s census efforts involved sending trained enumerators to every household in India and collecting data based on various socio-economic parameters.
In today’s era of AI, India draws inspiration from these monumental endeavours as it gathers data to train AI models.
Collecting data to train AI models
Indian IT giant Tech Mahindra, as part of Project Indus, has developed a Hindi LLM consisting of 539 million parameters and 10 billion Hindi+ dialect tokens.
The model can take instructions in 37 different dialects of Hindi, such as Dongri (Jammu & Kashmir), Kinnauri, Kangri, Chambeli, Garhwali, (Himachal), Kumaoni, Jaunsari (Uttar Pradesh), Bhojpuri, Maithili, and Magahi (Bihar), among others.
For Tech Mahindra too, the biggest challenge was data. “Despite various efforts, in India, datasets for languages other than Hindi are scarce and incomplete. Additionally, even Hindi data is fragmented,” Nikhil Malhotra, global head-Makers Lab, Tech Mahindra, told AIM.
Hence, Malhotra too sent experts to different geographies in India, especially the northern belt, where Hindi and its different dialects are predominantly spoken.
“Our team went to Madhya Pradesh, Rajasthan, and some remote areas in Bihar, and their job was also to collect data by interacting with professors and speakers of these languages,” Malhotra explained.
Likewise, in Telangana, the Swecha open-source software movement played a key role in constructing the inaugural Telugu small language model (SLM) named ‘AI Chandamama Kathalu‘ from scratch.
To collect data for the model, Swecha held datathons at different educational institutions in Telangana.
“Volunteers at Swecha collaborated with nearly 25-30 colleges, and over 10,000 students were involved in translating, correcting, and digitalising 40,000-45,000 pages of Telugu folk tales.
“Me and my R&D team and Ozonetel supported them with the graphics processing units (GPUs) to train the model,” Chaitanya Chokkareddy, co-founder and chief technology officer at Ozonetel Communications, told AIM.
Recently, Swecha also created a Telugu ASR dataset by sending volunteers to different parts of Telangana and Andhra Pradesh to speak to native speakers and collect voice samples.
The volunteers visited remote villages and schools and even collected data while on the road. Swecha gathered 1.5 million voice samples, which were trained to build a Telugu ASR model.
Similarly, under the Google-supported Project Vaani, the Indian Institute of Science (IISc) is gathering 150,000 hours of speech data spanning 773 districts across India.
To accomplish this, individuals engaged in the project are journeying to remote areas, displaying images to local residents, prompting them to describe the images, and subsequently recording their responses.
Creating employment opportunities
In India, some entrepreneurs also turn these data collection exercises into business opportunities and create rural employment. For example, Bengaluru-based Karya pays Indian citizens in rural and marginalised areas for data labelling and annotation.
“Our goal is to reach 100,000 rural Indians by the end of this fiscal year, 1.5 million rural Indians by next fiscal year and 100 million rural Indians by 2030,” Manu Chopra, co-founder at Karya previously told AIM.
Likewise, NextWealth has set up a network of ten centres across India and has assembled a workforce of nearly 5,000 individuals.
Through these centres, the company delivers a spectrum of services that encompass full support for AI/GenAI pipelines, including desk-based end-to-end human evaluation in AI/GenAI pipelines for some complex applications, including labelling and annotation of datasets, testing of outputs, etc.
India needs good Indic datasets
Although open-source datasets exist for some popular Indian languages like Hindi, and initiatives are ongoing to enhance these datasets, many languages still lack adequate datasets. This poses a significant challenge in developing LLMs for these languages.
Popular models like the Llama series by Meta or the GPT series by OpenAI are predominantly trained on large English datasets scraped from the web.
Even though we have seen Indic LLMs like Tamil LLama and Telugu LLama, they are also predominantly trained on open-source datasets available on the web.
However, there is a need to gather more data and build even better datasets. “The current volume of this type of data is relatively small; we need to collect even more,” Vivek Raghavan, co-founder of Sarvam AI, told AIM.
While efforts are already underway, for this to happen on a larger scale, an ecosystem needs to develop where different stakeholders, including researchers, startups and corporate houses, need to come together and work towards a common goal.