Last updated December 27, 2023
In AI Origins & Evolution

Meet தமிழ் Llama

The model boasts variants with 7 billion and 13 billion parameters.

Share

Published on December 27, 2023

by Siddharth Jindal

Meta’s Llama 2 has reached the land of temples. Thanks to Kaggle Master, and a young tech thalaiva Abhinand Balachandran, Llama can now converse in Tamil, marking the introduction of Tamil-Llama (தமிழ் Llama).

In an exclusive interview with AIM Balachandran said that he got the inspiration to build Tamil Llama from another Chinese model called Chinese Llama Alpaca. “Chinese is a bit of a complex language, but if they can make it work for Chinese, then surely we will also be able to make it work for Indian languages, right? So that was the motivation,”said Balachandran.

Balachandran shared that when he began working on Tamil Llama, there weren’t any language models for Indian languages. He started the project for research and later published a paper. “Since the model is still pretty young, it’s not really very, very great or something, but it is good. It can be used as a very good starting point.” added humble Balachandran.

Tamil Llama is trained with an additional 16,000 Tamil tokens, aiming to achieve superior text generation and comprehension in the Tamil language. This model serves as an extension of the LLaMA model and has been enhanced by incorporating extra Tamil tokens, utilising the LoRA methodology for efficient training.

“That step was actually crucial because the original Llama model did not have enough words in its vocabulary to accurately represent or even understand any aspect of common language,” said Balachandran.

Advantage over ChatGPT

Balachandran believes that Tamil Llama could be used in a RAG-based system to incorporate a variety of Tamil books or literature from common sources. “This way, it can be utilised for conversational purposes relevant to the specified period. Moreover, being bilingual, it can serve as a tool for learning English, particularly beneficial for individuals in remote areas lacking access to quality content.” he said.

“It can even generate a bit of code and explain concepts in Tamil. Use Cases that actually illustrate the kind of fine-tuning people are going to do with such models after they come out to the public will be really interesting.” he added.

Speaking of the advantage it has over GPT-4, Balachandran said that fine-tuning for OpenAI’s models is very expensive. Moreover, he mentioned that Tamil Llama is going to be a lot leaner than that. “You could even host it on your own systems, or maybe you could have a contextual version running on your laptop to interact with it for your day-to-day use case,” he said.

“These are 7-billion and 13-billion models, making them accessible to entry-level laptops. Even if you have something like 16 GB of RAM or 8 GB RAM with a dedicated GPU, it can easily run.” he added.

Moreover, Balachandran mentioned that other models available in the market, such as GPT-3.5 and 4, are mostly English-centric even though they can generate text in multiple languages. “However, as robust as these models, like LLaMA and Mistral, might be, their proficiency in generating coherent text in Tamil and several other Indian languages remains noticeably deficient.” he said.

A fundamental limitation lies in their minimal vocabulary of Tamil characters, which is essential for effective text encoding and generation” he added.

Jarvis Powers Tamil Llama

Balachandran said that since the beginning of the project, he tried to keep the cost as low as possible.

To train Tamil Llama, Balachandran mentioned that he used the services of a startup called Jarvis AI Labs. “There is a startup called Jarvis AI Labs, and they provide GPUs at very low costs,” he said.

He initiated the experimental stage there, which lasted approximately 30 minutes, during which he addressed debugging issues and monitored the proceedings in the Jarvis Labs instance, periodically checking for any anomalies.

After the experimental stage, he transitioned to Microsoft Azure. “I shifted to Azure, utilising A100 GPUs instance setup with 80 GB of RAM, all configured as spot instances. Fortunately, during this period, spot prices were exceptionally low—approximately $0.95 per hour for training,” he said.

“So, for both the 7 billion and 13 billion models, the total cost, since it’s a spot instance, becomes a challenge due to occasional deallocation. I had mechanisms in place to resume training even after it gets reallocated, allowing me to keep the cost significantly lower than opting for reserved instances.” he explained.

Main Challenges

Balachandran explained that the model is trained in two stages. According to him, the primary challenges were related to data availability, especially for Tamil text that could be found online.

“The first stage is pre-training, during which we expose the model to a substantial amount of data from the internet,” he said. For pre-training, he utilised a dataset called CulturaX, which offered a substantial Tamil corpus. “However, in terms of quality compared to English, it remains a challenge not only for Tamil but also for many Indian languages.” he added.

The second stage posed challenges due to insufficient data. “I had to either translate existing content from English to Tamil or explore the possibility of using GPT-4 to generate more Tamil content. Dealing with the data aspect was particularly challenging in this phase.” explained Balachandran.

Limitations

Tamil Llama can answer questions related to Tamil culture or even basic questions correctly. However, it still lacks the ability to fully comprehend Tamil culture. “Since it is just trained on internet documents and some translated instruction data, one of the main weaknesses is that it doesn’t know the cultural aspects of the region. So, that is also another thing I’m actually trying to improve in the next version,”said Balachandran.

Another limitation is that it can respond to malicious instructions as well. “I didn’t do the alignment step because it would have cost a little bit more, and that’s the reason I skipped that part. However, in the future, maybe I can implement that as well.” said Balachandran.

What’s Next?

Balachandran mentioned that he plans to build a multilingual LLM next. “I am currently experimenting with a multilingual model which is also one of my goals going into 2024 — to develop a multilingual LLM for Indian languages, a model that can work for Hindi, Telugu, and Tamil.” he said.

“There is a ‘Tamil Computing Conference’ happening in February 2024. I am also invited, and I’m excited for that,” he concluded. Balachandran will also be attending India’s biggest generative AI conference, MLDS 2024, on February 1-2, 2024, in Bengaluru. Register now.

Access all our open Survey & Awards Nomination forms in one place

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.