MITB Banner

India Draws Inspiration from Census To Collect Data for AI

India’s census efforts involved sending trained enumerators to every household in India and collecting data based on various socio-economic parameters. 

Share

india census inspiration ai

Illustration by Nikhil Kumar

Listen to this story

In 1950, Jawaharlal Nehru, India’s first prime minister, initiated the National Sample Survey to gather granular data on India’s economy. In 1953, the Hindustan Times dubbed it “the biggest and most comprehensive sampling inquiry ever undertaken in any country in the world”.

Over centuries, India has demonstrated its expertise in conducting successful population censuses. The decennial census of India, too, is often regarded as one of the biggest data collection exercises in the world. 

India’s census efforts involved sending trained enumerators to every household in India and collecting data based on various socio-economic parameters. 

In today’s era of AI, India draws inspiration from these monumental endeavours as it gathers data to train AI models.

Collecting data to train AI models 

Indian IT giant Tech Mahindra, as part of Project Indus, has developed a Hindi LLM consisting of 539 million parameters and 10 billion Hindi+ dialect tokens.

The model can take instructions in 37 different dialects of Hindi, such as Dongri (Jammu & Kashmir), Kinnauri, Kangri, Chambeli, Garhwali, (Himachal), Kumaoni, Jaunsari (Uttar Pradesh), Bhojpuri, Maithili,  and Magahi (Bihar), among others.

For Tech Mahindra too, the biggest challenge was data. “Despite various efforts, in India, datasets for languages other than Hindi are scarce and incomplete. Additionally, even Hindi data is fragmented,” Nikhil Malhotra, global head-Makers Lab, Tech Mahindra, told AIM.

Hence, Malhotra too sent experts to different geographies in India, especially the northern belt, where Hindi and its different dialects are predominantly spoken.

“Our team went to Madhya Pradesh, Rajasthan, and some remote areas in Bihar, and their job was also to collect data by interacting with professors and speakers of these languages,” Malhotra explained.

Likewise, in Telangana, the Swecha open-source software movement played a key role in constructing the inaugural Telugu small language model (SLM) named ‘AI Chandamama Kathalu‘ from scratch. 

To collect data for the model, Swecha held datathons at different educational institutions in Telangana. 

“Volunteers at Swecha collaborated with nearly 25-30 colleges, and over 10,000 students were involved in translating, correcting, and digitalising 40,000-45,000 pages of Telugu folk tales. 

“Me and my R&D team and Ozonetel supported them with the graphics processing units (GPUs) to train the model,” Chaitanya Chokkareddy, co-founder and chief technology officer at Ozonetel Communications, told AIM. 

Recently, Swecha also created a Telugu ASR dataset by sending volunteers to different parts of Telangana and Andhra Pradesh to speak to native speakers and collect voice samples. 

The volunteers visited remote villages and schools and even collected data while on the road. Swecha gathered 1.5 million voice samples, which were trained to build a Telugu ASR model. 

Similarly, under the Google-supported Project Vaani, the Indian Institute of Science (IISc) is gathering 150,000 hours of speech data spanning 773 districts across India. 

To accomplish this, individuals engaged in the project are journeying to remote areas, displaying images to local residents, prompting them to describe the images, and subsequently recording their responses.

Creating employment opportunities 

In India, some entrepreneurs also turn these data collection exercises into business opportunities and create rural employment. For example, Bengaluru-based Karya pays Indian citizens in rural and marginalised areas for data labelling and annotation.

“Our goal is to reach 100,000 rural Indians by the end of this fiscal year, 1.5 million rural Indians by next fiscal year and 100 million rural Indians by 2030,” Manu Chopra, co-founder at Karya previously told AIM.

Likewise, NextWealth has set up a network of ten centres across India and has assembled a workforce of nearly 5,000 individuals.

Through these centres, the company delivers a spectrum of services that encompass full support for AI/GenAI pipelines, including desk-based end-to-end human evaluation in AI/GenAI pipelines for some complex applications, including labelling and annotation of datasets, testing of outputs, etc.

India needs good Indic datasets

Although open-source datasets exist for some popular Indian languages like Hindi, and initiatives are ongoing to enhance these datasets, many languages still lack adequate datasets. This poses a significant challenge in developing LLMs for these languages.

Popular models like the Llama series by Meta or the GPT series by OpenAI are predominantly trained on large English datasets scraped from the web.

Even though we have seen Indic LLMs like Tamil LLama and Telugu LLama, they are also predominantly trained on open-source datasets available on the web.

However, there is a need to gather more data and build even better datasets. “The current volume of this type of data is relatively small; we need to collect even more,” Vivek Raghavan, co-founder of Sarvam AI, told AIM. 

While efforts are already underway, for this to happen on a larger scale, an ecosystem needs to develop where different stakeholders, including researchers, startups and corporate houses, need to come together and work towards a common goal.

Share
Picture of Pritam Bordoloi

Pritam Bordoloi

I have a keen interest in creative writing and artificial intelligence. As a journalist, I deep dive into the world of technology and analyse how it’s restructuring business models and reshaping society.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.