MITB Banner

[Exclusive] Tech Mahindra’s Nikhil Malhotra on Making Foundational Models for India

Tech Mahindra’s Project Indus aims to build LLMs from scratch for Hindi and its dialects, addressing the underrepresentation of Indic languages in existing AI models.

Share

Making Foundational Models for India

In a world of Kannada, Tamil, Telugu and Odia Llamas, and SarvamAI’s bilingual LLM OpenHathi supporting both Hindi and English, Tech Mahindra’s Project Indus is all geared up to build LLMs grounds-up. These will specifically cater to the speakers of the Hindi language across dialects like Bhojpuri, Kangdi, Angika, Nagpuri, Khortha, Kudmali, Panch-Parganiya, Hadauti etc.

“Hindi is spoken by 615 million speakers, much more than anybody who considers English as their first language as native speakers as well,” said Nikhil Malhotra, Global Head at Makers Lab, Tech Mahindra and the brain behind Project Indus, sharing with AIM the details of the project, data challenges, tech stack, and the roadmap ahead in an exclusive interaction. 

Malhotra said that most of the models in India are built on top of Llama or use existing foundational model APIs. Those that seem to be building from scratch are yet to reveal what they are really built on—underlining that “it’s INDUS and BharatGPT, which are actually built from the ground up”.

Project Indus started in June last year to build India’s very own foundational models from scratch; there has been no looking back ever since. “When I came back from the US, the whole idea revolved around how to start the language revolution?” said Malhotra, sharing his interaction with CP Gurnani, who took Sam Altman’s words a little too seriously. 

Coincidentally, Malhotra, around that time, was also working on a language model called BHAML, or Bharat Mark-up Language. This AI system was used by kids to code in a language of their choice. He was roped in to build Project Indus. 

“India did not have a foundational model at that point in time. So, we had different models. We had translational models like Bhashini, but there was no cut foundational modelling in the system,” shared Malhotra, speaking at length about the data challenges around low-resource Indic languages, particularly in Hindi (which has more than 49 dialects). 

He stressed on the underrepresentation of Indic languages in the existing AI systems and models, including GPT-4, Llama and others. “Most of these dialects are on the endangered list. And now, 80% of the Indians that do not speak English can also communicate,” said Malhotra.

He emphasised the significance of linguistic diversity, mentioning less common dialects like Angika, Nagpuri, Khortha, Kudmali, Panch-Parganiya, and Hadauti. “The aim is to serve languages spoken by 100,000 to 200,000 people, ensuring inclusivity across various linguistic groups.”

He further reiterated that, “The vision was to become a platform that can serve every Indian and every business… It could be a housewife, a patient, a farmer, or even machines.” This inclusivity defines the versatile nature of The Indus Project.

Data Collection and Language Diversity

The foundation of Indus relies on a vast dataset of 10 billion tokens, predominantly sourced from various parts of North India, Bhashini, and Bhasha-dan, etc. 

“I sent teams to the country’s northern belt. We went to Madhya Pradesh, Rajasthan, and some parts of Bihar.” The teams’ task was to collect Hindi and dialect data by interacting with professors and leveraging the Bhasha-dan portal available on ProjectIndus.in.

He also highlighted the initiative to involve Tech Mahindra employees in contributing everyday interactions through sentences like ‘Main ghar se bahar jata hoon‘ to the portal to gather diverse linguistic prompts.

However unique, these methods would’ve taken years to collect quality and diverse larger data sources so the team turned to using the Falcon structure, incorporating data from the pile dataset, a free open-source resource. Additionally, they translated some datasets into Hindi, contributing significantly to the overall data volume. 

Malhotra mentioned the development of toolkits to manage biases in the data, including a bias tool that identifies nine different types of biases in continuous text. About 70,000 to 80,000 sentences were annotated by individuals to create a biased analogy or classification algorithm, allowing for the identification and handling of biases in the dataset.

Model Architecture and Innovation

Project Indus has used decoder-architecture-based transformers. However, Malhotra noted that, “These are transformers, but they have tokenisers, which would be only in Hindi.”

Malhotra highlighted the significance of the Hindi-only tokenisation, with approximately “10 billion tokens and 539 million parameters”. The decoder stack undergoes pre-training before being embedded within the Transformers of Hugging Face for open-source utilisation.

The LLMOps system involves data collection, pretraining, fine-tuning, understanding model behaviour, monitoring, and production deployment. Malhotra mentioned using open-source tools like CommEt and Switch for LLMOps management.

Additionally, to address the challenge of computational efficiency and the economics of scale, he said they will use Tensor Networks, inspired by quantum, to optimise the model’s performance, which the team is expected to release in April. “We needed multi-parallel GPUs in terms of what we were doing. So we actually used C-DAC’s GPUs, about 48 large 40 GB GPUs, and trained them for at least four to five days,” said Malhotra. 

What’s Next? 

Project Indus is gearing up for diverse applications, including rural finance, agri-tech and media and entertainment. It aims to empower rural communities with a language model that understands their dialects, reducing the reliance on call centres. 

The team said that the model will be offered in open-source and enterprise-source formats, catering to innovators and enterprise applications. An innovative proposal involves making tractors conversational to address farmers’ challenges. Malhotra said that following the Hindi model, the focus of the mere 14-member driven team will shift to Bangla, which is spoken by up to 400 million people.

Furthermore, he emphasised India’s potential to lead in AI research and development. With abundant data, talent, and unique use cases, he urged the Indian ecosystem to focus on defining new algorithms for AI. The call is to leapfrog in terms of innovation and sustainability, ensuring that AI serves the diverse needs of the Indian populace.

“We’ve caught on to the trend of LLMs, but we are still not leading the trend. I think it’s time for Indian researchers and the ecosystem as a whole to lead the trend,” concluded Malhotra, outlining the key factors that position India uniquely for AI advancements: an abundant data source and a large talent pool. 

Recent News & Stories

Share
Picture of Shyam Nandan Upadhyay

Shyam Nandan Upadhyay

Shyam is a tech journalist with expertise in policy and politics, and exhibits a fervent interest in scrutinising the convergence of AI and analytics in society. In his leisure time, he indulges in anime binges and mountain hikes.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.