MITB Banner

Kannada Llama Finally Arrives 

"A Continually LoRA PreTrained & FineTuned 7B Indic model"

Share

Mumbai based software development company Tensoic has released Kannada Llama aka Kan-LLaMA [ಕನ್-LLama] — A 7B Llama-2 model, LoRA PreTrained and FineTuned on “Kannada” token.

The company said it expanded Llama-2’s existing linguistic capabilities for Low Resource Indic languages and specifically Kannada by fine tuning on 600 Million Kannada tokens and subsequently fine-tuning on SOTA Instruction Datasets The company said it will release the models, code, datasets and the paper(eventually) under permissive licenses.

“We Continually Pre Train Llama-2 on ~600 Million Kannada Tokens from the popular CulturaX Dataset. The dataset consists of multiple de-duplicated Multilingual dumps from popular scrapes such as mC4 and OSCAR. We randomly select documents from the same, resulting in a text corpus of ~11GB for the pre-training step.” the company wrote in its blog post. 

The model expansion involves the development of a tokeniser, a crucial tool for splitting text into smaller units or tokens. The vocabulary is increased from Llama-2’s existing 32K tokens to a total of 48K tokens, with a specific emphasis on efficiently processing Kannada text.

To achieve this, a sentence piece tokeniser with a vocab size of 20K is trained on a Kannada text corpus, also used for pretraining. This new tokenizer is then merged with the existing Llama-2 tokenizer, resulting in improved processing capabilities for Kannada text.

The subsequent fine-tuning phase is carried out on chat-optimised and translated datasets to enhance the model’s conversational capabilities. The curated datasets are released under various licenses, such as cc-by-4.0 and Apache 2.0, to encourage community contributions.

Axolotl is employed for the fine-tuning phase, providing an easy yet powerful environment through YAML configs to fine-tune Large Language Models (LLMs). The resulting Kannada Llama model showcases its generation capabilities through quantized versions.

Here is the dataset of Kan-LLaMA [ಕನ್-LLama]

Share
Picture of Siddharth Jindal

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India