Meet the Creator of ଓଡ଼ିଆ Llama 

Odia Generative AI recently created an AI tutor named Acharya for the CBSE Board.

Share

Meet the creator of Odia Llama

Illustration by Nikhil Kumar

When ChatGPT was introduced a year ago, Shantipriya Parida — the creator of Odia Llama — was quite disappointed that it did not understand any cultural context related to Odisha. 

Cut to present, he’s built Odia Llama (Llama2-fine tuned LLM for the Odia language) and started an open-source project called Odia Generative AI. The language is spoken by over 35 million people, and boasts a rich literary tradition and a unique identity. 

“I attempted to ask ChatGPT about local contexts. For example, I asked, ‘Can you tell me the recipe for rasgulla?’ It couldn’t provide an answer. It missed all the local context, and the answers weren’t even correct,” said Parida.

Parida is currently working in Finland as a senior AI scientist at Silo.ai, which recently released its own LLM named Poro. “If Europeans can build an LLM that can provide better answers than OpenAI’s ChatGPT in their local language, I thought why can’t we do it for our own Indic languages,” said Parida. “That’s the reason we are working hard, and it feels good when people appreciate our effort,” he added.

The Journey of Odia Llama 

To build Odia Llama, Parida formulated a three-step plan. “We’ll begin with a fine-tuned model, then move on to a pre-trained model, and finally, proceed to app deployment. Once we have the fine-tuned and pre-trained models ready, app development will be relatively straightforward,” he explained.

Odia Llama, currently available on Hugging Face, is the fine-tuned version. Parida noted that his team of researchers is presently working on the development of the pre-trained version, which is currently in progress.

“All the fine-tuned models have some limitations. In the foundation model, if you have only 0.5% or 2% of data in your local language, then, after a certain point, no matter how much you fine-tune it, it will get stuck,” he said.

To train a pre-trained model, Parida’s team is currently working on collecting data. “We are collecting a lot of tokens. I think we have already collected around 30 million tokens, and we are targeting at least 40 to 50 million tokens. This way, we can expedite the release of our first pre-trained model,” said Parida.

The data is sourced from various online platforms, including blogs, Wikipedia, Odia newspapers, local textbooks, literature, magazines, and government websites. Parida said that they have also developed in-house tools, named Olive Scraper and Olive Farm. Olive Scraper is a web scraping tool for extracting Odia content from various sources (e.g., websites, PDF, DOC, etc.), while Olive Farm generates LLM instruction sets in Indic languages.

Presently, it offers support for Hindi and Odia, with seamless scalability to incorporate additional languages on the horizon.

Regarding computing, GPUs, and infrastructure, Parida mentioned that they received support from E2E Networks. Moreover, he said that for fine-tuning, they have ample resources, as his team consists of various independent researchers, who have access to GPUs, which they utilise for research purposes.

AI Tutor 

Odia Generative AI recently created an AI tutor named Acharya. This tutor facilitates self-learning in Hindi for students, offering real-time doubt resolution. Acharya was developed using LLM (Mistral-7b Hindi, fine-tuned) and retrieval augmented generation (RAG).

“For example, if you’re a tutor and want to create a comprehensive lesson plan on a specific topic, this can assist with that. Similarly, if you want to evaluate, for instance, create a set of questions and assess them, it can be helpful. So, it has multiple use cases,” said Parida.

Acharya operates as a client-server web application, with the client built using JavaScript and the server utilising Python with Fast API for seamless communication. Initial assessment scores indicate BERT (F1): 0.72 and RAGAS Answer Relevancy: 0.72. 

The current demonstration version caters to Class 8 subjects of the CBSE board. It’s worth noting that Acharya will soon support various languages, cover a wide range of subjects, and be freely accessible.

What’s Next? 

Given that there are now several Indic LLMs out in the market, like OpenHathi, Airavat, Krutrim and BharatGPT, Parida wants to create an Indic LLM benchmark next. 

“We are planning to build an LLM benchmark. You go, choose your model, and it will automatically tell you your model’s accuracy per task. It can be a fair comparison for anybody who wants to pick a model for research or any other purpose,” said Parida.

Along similar lines to AI Tutor, Parida wants to build more AI apps focusing on government budgets and policies which would make it easier for local citizens to get information in native languages. 

“As a citizen, many times one wants to know about government policies and what the government is trying to do in your area. But you don’t know exactly whom to ask. Nowadays, information is available in the public domain, so it’s easy to build an AI app using an open-source model and using RAG,” said Parida. 

“We are not a company, and we are building without the intention of selling anything. We started with one objective: to ensure that our Odia language does not lag behind. So, whatever we are building is solely for the benefit of the people,” he concluded.

Share
Picture of Siddharth Jindal

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India