Given the rise of Indic LLMs in Tamil, Telugu, Hindi, and Odia, it was only a matter of time that a Kannada language model came up, and it’s finally here. Meet Kannada Llama aka Kan-LLaMA, a 7 billion Llama 2 model which is LoRA pre-trained and fine-tuned on “Kannada” token, built by a Mumbai-based company called Tensoic.
“It all started when Meta’s Llama dropped and then we saw all these Indic language models coming up,” said Adarsh Shirawalmath, the 2nd year B.Tech student at Vellore Institute of Technology, who is the creator of Kan-LLaMA, in an exclusive interaction with AIM.
“Our college had a couple of GPU clusters and we got access to them, and started messing around with them,” Shirawalmath said that he and his co-founder, Raghav Ravishankar, started collaborating with people online and were determined to build their own language model, and started using college GPUs.
That is also how they got in touch with Adithya Kamath and Bharat Shetty Barkur, who also contributed to the creation big time.
When the AWS Campus Fund was announced at VIT, Shirawalmath and Ravishankar were very excited about getting funded for building AI models. But the minimum requirement for that was having a registered company. “I said that if we’re planning on building this venture, let’s just do it,” he narrated how they quickly registered a company within 15 days.
“We randomly came up with the name Tensoic, which means ‘Tensor’ plus ‘Logic’, and we had no goals then, we were just fishing stuff.”
Bound to happen
The team is still working on the research paper as Shirawalmath said that he wants to make it perfect. “I don’t even know how to read Kannada,” he laughed. “We got experts of the Kannada language for the data curation part, while we worked on the model.”
Born in Davanagere in Karnataka, Shirawalmath has lived most of his life in Mumbai, and then started studying at VIT. “I used to do ethical hacking being a bug bounty for Shopify and many other companies,” he added. “I have always been interested in AI but the ChatGPT frenzy got me wondering about the business use cases of generative AI and LLMs.”
ChatGPT is not specialised, but a general chatbot only for conversations. The team saw many use cases in multilingual algorithms. “Customising something like ChatGPT for a dynamic country like India was a very huge task,” he said, but the team decided to build it from scratch, and make the model and dataset open source on Hugging Face.
The pre-training process occurred on a solo NVIDIA A100 80GB instance, requiring approximately 50 hours and incurring an estimated cost of $170. The resultant LoRA adapter attained a size of around 1.1GB.
In their blog post, the researchers conveyed that they pre-trained Llama 2 on approximately 600 million Kannada tokens from the well-known CulturaX dataset. It comprises diverse de-duplicated multilingual dumps obtained from popular scrapes such as mC4 and OSCAR.
More models coming soon
Currently, the model is built on top of Llama 2, but Shirawalmath said that the team is also planning to make it on top of Mistral’s models, but the dataset is a little messy and not ready for Indic models yet.
“We are also planning to build a Gujarati Llama soon, but it’s just the beginning phase,” he added about future plans and the possibility of releasing more Indic models in the coming months.
Shirawalmath said that there are various use cases of building Indic LLMs and the ones that Tensoic is focused on are mostly in the healthcare sector, and also primarily focusing on the defence sector.
“When Sam Altman said that it is impossible to compete with OpenAI, we decided to join the movement for Indic languages acceleration,” Shirawalmath added about his motivation, and how models like Tamil Llama, Telugu Llama, and Sarvam AI’s OpenHathi pumped his enthusiasm even further.
Given the battle between the US and China about AI models, Shirawalmath said that he was surprised that there were no models coming out from India. “I think we should follow what Japan is doing in terms of AI policies,” he added about the need for less tapping and bureaucracy for rapid AI development in India and grinding out LLMs.
“I think it’s crucial for India to have technologies like what the US has. That is the main motive that we have,” Shirwalamath concluded, saying that India’s generative AI moment is just getting started and the team is looking for further collaboration.