Mumbai based software development company Tensoic has released Kannada Llama aka Kan-LLaMA [ಕನ್-LLama] — A 7B Llama-2 model, LoRA PreTrained and FineTuned on “Kannada” token.
The company said it expanded Llama-2’s existing linguistic capabilities for Low Resource Indic languages and specifically Kannada by fine tuning on 600 Million Kannada tokens and subsequently fine-tuning on SOTA Instruction Datasets The company said it will release the models, code, datasets and the paper(eventually) under permissive licenses.
“We Continually Pre Train Llama-2 on ~600 Million Kannada Tokens from the popular CulturaX Dataset. The dataset consists of multiple de-duplicated Multilingual dumps from popular scrapes such as mC4 and OSCAR. We randomly select documents from the same, resulting in a text corpus of ~11GB for the pre-training step.” the company wrote in its blog post.
The model expansion involves the development of a tokeniser, a crucial tool for splitting text into smaller units or tokens. The vocabulary is increased from Llama-2’s existing 32K tokens to a total of 48K tokens, with a specific emphasis on efficiently processing Kannada text.
To achieve this, a sentence piece tokeniser with a vocab size of 20K is trained on a Kannada text corpus, also used for pretraining. This new tokenizer is then merged with the existing Llama-2 tokenizer, resulting in improved processing capabilities for Kannada text.
The subsequent fine-tuning phase is carried out on chat-optimised and translated datasets to enhance the model’s conversational capabilities. The curated datasets are released under various licenses, such as cc-by-4.0 and Apache 2.0, to encourage community contributions.
Axolotl is employed for the fine-tuning phase, providing an easy yet powerful environment through YAML configs to fine-tune Large Language Models (LLMs). The resulting Kannada Llama model showcases its generation capabilities through quantized versions.
Here is the dataset of Kan-LLaMA [ಕನ್-LLama]