MITB Banner

Meet the Creators of महामराठी 

Based on their shared passion for Marathi and making generative AI accessible for non-English speakers, the team started working on the Marathi LLM. 

Share

Illustration by Nikhil Kumar

Listen to this story

According to the 2011 census, India is home to about 83 million native Marathi speakers, especially in Maharashtra. This makes Marathi the third most-spoken Indian language after Hindi and Bengali. The state, a major hub for startups and major tech firms, accounts for 12.92% of India’s GDP and has a per capita income higher than the national average. 

Understanding this importance, a team of US-based Indian researchers—Aakash Patil, a postdoctoral researcher at Stanford University, Mrunmayee Shende, cofounder of CourtEasy AI and Niraj Kumar Singh, an ML engineer at Inbound Health—came up with MahaMarathi 7B. Joining the league of indic LLMs like Telugu, Malayalam, Tamil, and Odia Llama, MahaMarathi has been built on seven billion parameters. It is domain-adapted, continually pre-trained, and instruction fine-tuned using the Llama-2 and Mistral AI framework.

The Inception 

“When GPT-4 came out, we realised the importance of doing something for Indian languages. Although it wasn’t our primary focus initially, the release of newer models like Meta’s Llama motivated us to build on top of them. The idea for MahaMarathi started in June last year, initially just as a concept,” Patil told AIM, sharing the inception story of the model. 

He said that MahaMarathi would not have been possible without the help of Microsoft for Startups backed legal tech startup CourtEasy AI, which provided them with all the computing resources and data to train the model. The model was trained on NVIDIA A1 100 GPUs, procured by the startup under the program. 

“Why Marathi? It is our mother tongue, and we have spoken it since childhood, so we thought we could contribute to the Indic LLM space,” added Shende. All of them were born and raised in Maharashtra. Patil is from Akola, Shende is from Satara, and Singh is from Nagpur. 

“Our shared passion for Marathi and the desire to bridge technological gaps motivated us to work on this project,” commented Patil, adding to what Shende said.

Challenges Galore 

The making of MahaMarathi was not a cakewalk, as the team was hindered by computing power and data availability challenges. Even though Patil had access to powerful supercomputers because of his research, he could not use them for personal projects. 

However, things started to change around June last year when he discovered Shende and Singh share his vision of building an indigenous Marathi LLM. 

Shende’s CourtEasy AI primarily operates in the legal sector, developing AI tools for lawyers, paralegals, and law firms in India. “Initially focusing on English, we soon recognised the necessity of including Indian languages, as numerous cases in lower courts are conducted in these languages,” said the co-founder. 

So, Shende had been collecting data for quite some time, and by the end of December, she had amassed a significant corpus.

Datasets and Training Method

The researchers compiled a large corpus of about five million words in Marathi for their language dataset. This corpus, sourced over six months from publicly available content on websites, blogs, media, and news outlets, formed the basis for the initial pre-training of their model.

The team developed a new tokeniser for the Marathi language to manage this large corpus. After creating the tokeniser and expanding the vocabulary size, they pre-trained the model for next-token prediction tasks.

During pre-training, after achieving satisfactory results in next-token and next-sequence prediction, the team shifted their focus to fine-tuning. For this phase, they used datasets from Stanford Alpaca and Microsoft Orca, which were translated into Marathi and then cleaned to ensure accurate and contextually appropriate translations. 

Satisfied with the model’s sequence prediction and context generation capability, they further refined the fine-tuning process using various datasets, including translations done with IndicTrans by AI4Bharat. IndicTrans is notable as the first open-source transformer-based multilingual NMT model supporting high-quality translations across all 22 scheduled Indic languages.

When asked about the primary databases used, Patil explained that they stored the tokenized data on a hard disk, while MongoDB Atlas was used to store the fine-tuned pairs, which amounted to approximately 60,000. Since vectorised databases are unsuitable for storing large corpora, the team primarily relied on Amazon S3 due to its extensive storage capacity.

The researchers chose not to use alternatives like the BLOOM family of open-source models or other multilingual models. Instead, they opted for a combination of the Mistral AI framework and Llama 2 architecture. This approach involved enhancing the Llama architecture by initialising Transformer layers with Mistral’s weights and modifying the MergeSkate library to improve efficiency and multilingual support.

What’s Next? 

In the last three weeks since the release, the feedback has been overwhelmingly positive for MahaMarathi, especially from the Marathi community in the Bay Area and large enterprises, Intel being one of them, as Shende explained. 

Depending on the feedback for the base model, the trio will also release SFT and DPO models. However, the team also noticed a need to educate the public on using pre-trained models and fine-tuning. They plan to release collaborative notebooks to assist smaller businesses and medium-scale enterprises in integrating AI into their operations.

Join Rising 2024, the largest summit on Diversity and Inclusion in India, taking place on April 4-5 in Bangalore. Grab your passes now.

Share
Picture of Shritama Saha

Shritama Saha

Shritama (she/her) is a technology journalist at AIM who is passionate to explore the influence of AI on different domains including fashion, healthcare and banks.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.