How Good is Llama 3 for Indic Languages?

“Llama 3 Dhenu = Mom, bring 3 cows”

Share

Illustration by Nikhil Kumar

Published on April 19, 2024

by Mohit Pandey

Listen to this story

Meta’s contribution to the open-source community is undeniably one of the greatest. Now that the company has dropped Llama 3, its latest LLM, all the Indic LLM developers who have been building ‘Indic Llamas’ can now shift from Llama 2 to Llama 3. But all of this comes with certain twists and riders.

For starters, the model is available in 8B and 70B parameter versions and has been trained on over 15 trillion tokens, making it seven times larger than Llama 2’s dataset. This means that it has increased reasoning and coding capabilities, but it does not necessarily improve on the tokenisation aspect of the model for Indic languages.

Moreover, it is reportedly only available in English right now.

“It’s going to be hard to adapt Llama 3 for Indic languages, in my opinion,” said Adithya S Kolavi, the founder of CognitiveLab and the creator of the Indic LLM Leaderboad.

Even though initial tests show better performance with Devanagari compared to Llama 2, it struggles with other languages like Kannada, Malayalam, and Tamil. More testing is needed to fully assess Llama 3’s performance with these languages.

He explained in his blog that Llama 3 uses a TikToken-based tokeniser, which struggles to efficiently tokenise Indic languages, even with a vocabulary size of 121k. Moreover, when it comes to vocabulary expansion, unlike models using sentence-piece tokenisation, Llama 3 may face difficulties in expanding its vocabulary to better handle the wide variety of Indic languages.

https://twitter.com/adithya_s_k/status/1781018407519043720

Worth the wait?

Kurian Benoy, ML engineer at Sentient.io, expressed his disappointment with Llama 3 as he expected it to be multimodal. “Dear Zuck Bhai, I am sad. Promise me, you will do a better job for Llama4,” he said in a post on X.

He also posted a screenshot on LinkedIn testing Llama 3 on a few questions in the Indian context. “Not so bad, but still there is a lot of room to improve in my quick analysis,” he opined.

On the other hand, the biggest version of Llama 3 with 400 billion parameters is still in training, which might be multimodal, as many expect.

Having tried Llama 2, Gemma, Mistral, and other open source models, the Indic AI community had been desperately waiting to get their hands on Llama 3. Ramsri Goutham Golla, one of the creators of Telugu LLM Labs along with Ravi Theja Desetty, also shared his initial thoughts on the model.

He highlighted that more than 5% of Llama 3’s pre-training dataset consists of high-quality non-English data from over 30 languages, which include Indic languages as well.

Kolavi had also told AIM that the problem with Indic models is that they take more time than English models because the number of tokens is significantly higher. He said that CognitiveLab has been using Llama 2 and Mistral for a lot of internal work, but according to him, the most versatile model when it comes to the Indic LLM landscape is Gemma because it is a pre-built context of Indian languages.

“What I observed was that you need to re-train the model for it to perform well,” Kolavi explained that the Llama needs to be trained on at least 5-10 billion tokens for it to have great performance. “The model doesn’t really adapt the vocabulary that well, but it is a good alternative,” he added.

Now that Llama 3’s tokeniser has a length of 128k, which is four times longer than the 32k tokeniser in Llama 2, it is also trained on 15 trillion tokens, significantly more than Llama 2’s 2 trillion tokens and Google’s Gemma, which was trained on 6 trillion tokens. It means that there might be a possibility that Llama 3 may actually be a really good model for Indic languages.

What’s in a name?

Here comes another twist to the story. Meta has highlighted in its licence that any model built on top of Llama 3 should include “Llama 3” in the beginning of its name. Moreover, Meta has also forbidden the usage of any output generated by the models to be used to train any other AI model apart from Llama 3 derivatives.

“This is some BS,” said Pratik Desai, the creator of KissanAI who also created the Dhenu model on top of Llama 2. But since Meta has been giving away the model to everyone, it seems like a fair ask. Meanwhile, Desai also confirmed that Dhenu Llama 3 would be coming soon.

Now, it will be interesting to see how companies such as Sarvam AI, who have built OpenHathi on top of Llama 2, adapt to this new rule.

Meanwhile, Meta AI chief Yann LeCun has been quite impressed with the Indic Llama landscape. He applauded Kannada Llama on X, saying: “I love this. This is why open source AI platforms will win: it’s the only way for AI to cater to highly diverse languages, cultures, values, and centers of interest.”

It is now time for the Indic Llama boys to test out Llama 3 and create a bunch of Llama models in every Indic language.

Access all our open Survey & Awards Nomination forms in one place