Listen to this story
Besides the 22 major languages recognised in the Indian Constitution, 19,569 dialects are spoken as mother tongues. According to UNESCO, around 192 of these languages are classified as vulnerable or endangered. Now, Microsoft—through Project ELLORA—wants to leverage the power of AI and help preserve these languages, which have limited written resources, let alone any digital presence.
“The project is about enabling language communities with technology. We want to put out a whole series of tools and pipelines so that communities can build technologies for themselves, to a certain extent at least,” Kalika Bali, Principal Researcher at Microsoft Research India, told AIM.
An open-source framework
Project ELLORA aims to prevent these languages from lagging behind in the current advancements in language technology facilitated by artificial intelligence (AI) and sophisticated natural language models.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
“Our entire purpose in this project is to enable the community to build the technology. So, these are the communities that kind of came to us and sought our help regarding the same,” Bali said.
Microsoft is working on three Indic languages, primarily—Gondi, which is spoken across Andhra Pradesh, Telangana, Madhya Pradesh, Maharashtra, and Chhattisgarh; Mundari, which is an Austro-Asiatic language, is spoken in Jharkhand, Odisha and West Bengal; and lastly, Idu Mishmi, which is spoken Arunachal Pradesh.
“We are going to put out the framework and open source it so that the communities themselves are able to create these technologies,” Bali shared.
While Bali and her team have extensively worked on these three languages so far, there are plans to expand to other languages as well.
“For example, the dictionary framework that we have come up with for the digital dictionary for Idu Mishmi has intrigued other language communities in Arunachal Pradesh. Further, there are communities in Bengal which are interested in creating those kinds of dictionaries.”
“With the work that we are doing with Mundari, we hope to put out all the models that we built to show the communities how they can use these tools to their requirements,” Bali said.
Fulfilling community requirements with technology
To aid the communities, Microsoft has developed the Interactive Neural Machine Translation (INMT) tool, built on the existing open-source MT framework-OPENNMT.
The INMT tool is developed to aid human translators with real-time tips and recommendations, thereby expediting the end-to-end translation process, enhancing its efficiency, and producing translations of superior quality.
To address low or non-existent connectivity and enhance accessibility for mobile-only users, Microsoft has created INMT-Lite, a mobile-based offline version of INMT.
For the Gondi language, Microsoft is partnering with CGNet Swara, a citizen journalism portal that collaborates with the Gondi-speaking tribal population in central India. The aim of the collaboration is to create a Hindi–Gondi translation system to provide Hindi content to the Gondi-speaking community.
Similarly, for Idu Mishmi, the community had a very specific demand, Bali said. “According to the Arunachal Pradesh government, Idu Mishmi can now be taught in primary school, but there is no content to teach from. There are no supporting resources for the children to learn Idu Mishmi in schools.”
For Mundari, too, the requirements were similar. The community wanted to create datasets which could be used to educate children as there are very few resources available.
A project like ELLORA has great potential because, today, most of the content available online and otherwise is majorly in English. But in India, only 10% of the population can understand the language. When it comes to Indic languages, most of the content is available in Hindi but not in the numerous other languages spoken in India.
Initiatives such as A14Bharat and Syspin are also building datasets of Indic languages; however, their focus is on the 22 major languages recognised by the Constitution. Conversely, with Project ELLORA, Microsoft shifts the focus towards language communities that are not included in AI4Bharat and similar initiatives.
Besides helping access the content on the internet in Indic languages, Project ELLORA could also be hugely beneficial in delivering government services and schemes.
Recently, it was reported that the Ministry of Electronics and Information Technology (MeitY) is building a chatbot using the GPT3.5 architecture, the same language model series that powers ChatGPT, with WhatsApp to deliver key government schemes.
The chatbot is being developed to help Indian rural farmers access information in Indic languages. Initially, the chatbot is set to be available in Hindi, English, Tamil, Telugu, Bengali, Marathi, Kannada, Assamese, and Odia.
However, more languages would be added later and this is where a project like ELLORA could come in. MeitY could readily utilise the existing dataset in three languages to train its chatbot and facilitate the provision of government schemes and services in those languages.