MITB Banner

AI4Bharat Rolls Out IndicLLMSuite for Building LLMs in Indian Languages

It covers 22 languages with 251 billion tokens and 74.8 million instruction-response pairs. 

Share

Listen to this story

Even though India’s contribution to Indic LLMs has skyrocketed in the last year, the lack of open-source pipelines for low and mid-resource languages hinders their representation in LLM training datasets. 

To address this, AI4Bharat has created IndicLLMSuite, a collection of resources for Indic LLMs covering 22 languages with 251 billion tokens and 74.8 million instruction-response pairs. Let’s take a look at some of the kit’s key resources. 

Sangraha

This includes data for pre-training data containing 251B tokens summed up over 22 languages, extracted from curated URLs, existing multilingual corpora, and large-scale translations. 

Setu

This is a Spark-based distributed pipeline customized for Indian languages for extracting content from websites, PDFs, and videos. It has in-built stages for cleaning, filtering, toxicity removal, and deduplication.

IndicAlign-Instruct

It offers a varied set of 74.7 million prompt-response pairs in 20 languages. 

These pairs are gathered using four methods, including compiling existing Instruction Fine-Tuning (IFT) datasets, translating English datasets into 14 Indian languages with an open-source translation model, generating discussions from India-centric Wikipedia articles using open-source LLMs, and setting up a crowdsourcing platform named Anudesh for prompt collection. The team has also introduced a new IFT dataset to teach language and grammar to the model, drawing from IndoWordNet, a resource-rich vocabulary. 

IndicAlign – Toxic

Finally, we have IndicAlign – Toxic, which consists of 123K pairs of toxic prompts and non-toxic responses generated using open-source English LLMs and translated to 14 Indian languages for safety alignment of Indic LLMs.

You can access the data and codes here. 

Earlier this month, Sarvam AI, along with AI4Bharat and IIT Madras, unveiled IndicVoices, a comprehensive speech dataset adhering to an inclusive diversity wishlist with fair representation of demographics, domains, languages, and applications. The IndicVoices dataset comprises  7348 hours of natural and spontaneous speech from 16237 speakers across 145 Indian districts and 22 languages. 

Share
Picture of Shritama Saha

Shritama Saha

Shritama (she/her) is a technology journalist at AIM who is passionate to explore the influence of AI on different domains including fashion, healthcare and banks.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.