CognitiveLab Unveils Ambari, Bilingual Language Models in Kannada-English

Its inaugural models, 𝗔𝗺𝗯𝗮𝗿𝗶-𝟳𝗕-𝗯𝗮𝘀𝗲-𝘃𝟬.𝟭 and 𝗔𝗺𝗯𝗮𝗿𝗶-𝟳𝗕-𝗜𝗻𝘀𝘁𝗿𝘂𝗰𝘁-𝘃𝟬.𝟭, achieve impressive results on a compact 1 billion-token training dataset, trained across multiple stages.

Share

Published on January 14, 2024

by Siddharth Jindal

CognitiveLab has introduced Ambari, an open-source Bilingual Kannada-English Large Language Models (LLMs) series. The initiative addresses the challenges posed by the dynamic landscape of LLMs, with a primary focus on bridging the linguistic gap between Kannada and English.

🚨New Indic LLM alert🚨

Introducing Ambari – A series of Open Source Bilingual Kannada-English Large Language Models!

Ambari tackles the challenge of adapting LLMs for indic languages, starting with Kannada. pic.twitter.com/ODODGvagsC
— CognitiveLab (@cognitivelab_ai) January 13, 2024

In the blog post, CognitiveLab shares insights into the purpose behind Ambari and the meticulous approach taken during its development. The project is driven by the need to pioneer language adaptability within LLMs, pushing the boundaries of efficiency by training and fine tuning on a modest 1 billion-token dataset.

Ambari’s training process involves distinct stages, including pre-training, bilingual next token prediction/translation, instruct fine-tuning, and more. Efficient tokenization, a critical component, is achieved through a specialized model using SentencePiece, addressing challenges posed by Kannada text within open-source LLMs.

Continual pre-training with a curated dataset of 500 million tokens is highlighted, showcasing the commitment to open-source knowledge sharing with the availability of fully fine-tuned model weights on Hugging Face.

A pivotal addition to the training strategy is the phase of bilingual next token prediction, inspired by the Hathi series. Challenges in translation and fine-tuning are acknowledged, emphasizing the commitment to refining bilingual capabilities within Ambari.

The blog details supervised fine-tuning with low-rank adaptation, introducing a chat template structure for bilingual instruct fine-tuning. The final phase explores Direct Preference Optimization (DPO) using the Anthropic/hh-rlhf dataset, undergoing evaluation for its impact on performance.

Learnings and observations include occasional hallucinations, translation nuances, and the dilemma of full weight fine-tuning. The future roadmap for Ambari includes the incorporation of Romanized Kannada, refinement of data pipelines, and scaling the training dataset for continuous learning and model enhancement.

Interestingly, this is the second Kannada-based LLM. Recently, Mumbai-based software development company Tensoic released Kannada Llama, also known as Kan-LLaMA [ಕನ್-LLama] — a 7B Llama-2 model, LoRA PreTrained and FineTuned on ‘Kannada’ tokens.

Access all our open Survey & Awards Nomination forms in one place