Last updated January 12, 2024
In AI News & Update

Kannada Llama Finally Arrives

"A Continually LoRA PreTrained & FineTuned 7B Indic model"

Share

Published on January 12, 2024

by Siddharth Jindal

Mumbai based software development company Tensoic has released Kannada Llama aka Kan-LLaMA [ಕನ್-LLama] — A 7B Llama-2 model, LoRA PreTrained and FineTuned on “Kannada” token.

The company said it expanded Llama-2’s existing linguistic capabilities for Low Resource Indic languages and specifically Kannada by fine tuning on 600 Million Kannada tokens and subsequently fine-tuning on SOTA Instruction Datasets The company said it will release the models, code, datasets and the paper(eventually) under permissive licenses.

“We Continually Pre Train Llama-2 on ~600 Million Kannada Tokens from the popular CulturaX Dataset. The dataset consists of multiple de-duplicated Multilingual dumps from popular scrapes such as mC4 and OSCAR. We randomly select documents from the same, resulting in a text corpus of ~11GB for the pre-training step.” the company wrote in its blog post.

The model expansion involves the development of a tokeniser, a crucial tool for splitting text into smaller units or tokens. The vocabulary is increased from Llama-2’s existing 32K tokens to a total of 48K tokens, with a specific emphasis on efficiently processing Kannada text.

To achieve this, a sentence piece tokeniser with a vocab size of 20K is trained on a Kannada text corpus, also used for pretraining. This new tokenizer is then merged with the existing Llama-2 tokenizer, resulting in improved processing capabilities for Kannada text.

The subsequent fine-tuning phase is carried out on chat-optimised and translated datasets to enhance the model’s conversational capabilities. The curated datasets are released under various licenses, such as cc-by-4.0 and Apache 2.0, to encourage community contributions.

Axolotl is employed for the fine-tuning phase, providing an easy yet powerful environment through YAML configs to fine-tune Large Language Models (LLMs). The resulting Kannada Llama model showcases its generation capabilities through quantized versions.

Here is the dataset of Kan-LLaMA [ಕನ್-LLama]

Access all our open Survey & Awards Nomination forms in one place