MITB Banner

IIT Madras Develops AI Models to Process Text In 11 Indian Regional Languages

Share

IIT Madras Develops AI Models to Process Text in 11 Indian Regional Languages

Illustration by IIT Madras Develops AI Models to Process Text in 11 Indian Regional Languages

Indian Institute of Technology Madras Faculty has developed artificial intelligence models and datasets to process texts in 11 Indian regional languages. The project has been taken up jointly with AI4Bharat, a platform for building AI solutions for problems of relevance to India.

The researchers from IIT Madras and AI4Bharat released AI models and datasets for the following languages — Tamil, Hindi, Malayalam, Telugu, Kannada, Punjabi, Bengali, Odia, Assamese, Gujarati, and Marathi. The multilingual AI models and datasets developed through this initiative will provide the essential building blocks to students, faculty, start-ups and industry to work on Indian language tools and push the frontiers of technology. 

The faculty have made these cutting-edge resources open-source and completely free of cost, which can be accessed by anyone. These models are freely available and can be downloaded from a Github repository. An accompanying research paper describing the research methodologies and evaluation have been accepted at EMNLP-Findings.

Elaborating on this initiative, Dr Mitesh M. Khapra, Assistant Professor, Department of Computer Science and Engineering, IIT Madras, said, “We have a vibrant diversity of languages in our country. As we move towards a digital economy, our languages must find a space online. This requires a lot of innovation in creating input tools, datasets, and AI models for Indian languages.”

For example, imagine a learner who posts a question on an e-learning platform in Tamil or Hindi or any other numerous Indian regional languages. There is a need for tools that can automatically process such questions written in Indian languages and classify them into specific topics.

“While such tools are available for English and other foreign languages, there are hardly any tools for Indian languages, and this is the critical gap that we are trying to address through this initiative. These models are available free of cost as we want the entire country to benefit from them,” added Dr Mitesh Khapra.

AI4Bharat is an initiative co-founded by Dr Mitesh M Khapra and Dr Pratyush Kumar from IIT Madras and works to solve India specific problems in a community-driven, open-sourced manner. Both Dr Mitesh Khapra and Dr Pratyush Kumar are also associated with the Robert Bosch Centre for Data Science and Artificial Intelligence.

Speaking about the technology behind this initiative, Dr Anoop Kunchukuttan, a volunteer at AI4Bharat and the lead researcher on this project, said, “We have an urgent responsibility to take the rapid advances of AI and make them accessible to the common man. One way of achieving this is to improve the interactions between humans and machines. That is where the field of natural language processing comes in. NLP is a branch of AI that deals with the interaction between computers and humans using natural language.”

Adding on, Dr Pratyush Kumar, Assistant Professor, Department of Computer Science and Engineering, IIT Madras, said, “This initiative is one of the few attempts in academia to develop and publicly release such large scale multilingual AI models containing millions of parameters trained on billions of tokens from 11 Indian languages, completely free and open-source.”

For the past year, a team of researchers comprising students, faculty and volunteers from IIT Madras and AI4Bharat worked on collecting data and training powerful models for processing text written in Indian languages. The models take advantage of the similarities between Indian languages to make efficient use of data. With these models, the researchers have been able to push the state-of-the-art for Indian language processing on several tasks such as document classification, sentiment analysis, semantic matching, paraphrase detection and so on.

Highlighting the work done on natural language processing, Dr Kumar said, “Modern NLP systems are driven by deep learning. A fundamental piece of these systems is language models, which capture meanings of words and sentences and their relations and require a large amount of data to train. The unavailability of such data has prevented the development of such models for Indian languages. As a result, Indian NLP has not been able to progress at the rate at which it should.”

Dr Anoop Kunchukuttan added, “We really hope that start-ups and social initiatives working on Indian language technologies will be able to take our pre-trained models and adapt them to specific use cases by collecting smaller amounts of in-domain data.”

The Research Team hopes that this initiative will serve as a ‘call to action’ for academia, government and industry to come together and develop bigger and more diverse datasets for Indian languages. Data drives AI technology, and it is time to make a serious investment in building datasets for Indian languages.

Natural language processing is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyse large amounts of natural language data.

Share
Picture of Sejuti Das

Sejuti Das

Sejuti currently works as Associate Editor at Analytics India Magazine (AIM). Reach out at sejuti.das@analyticsindiamag.com
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.