MITB Banner

CognitiveLab Unveils Ambari, Bilingual Language Models in Kannada-English

Its inaugural models, 𝗔𝗺𝗯𝗮𝗿𝗶-𝟳𝗕-𝗯𝗮𝘀𝗲-𝘃𝟬.𝟭 and 𝗔𝗺𝗯𝗮𝗿𝗶-𝟳𝗕-𝗜𝗻𝘀𝘁𝗿𝘂𝗰𝘁-𝘃𝟬.𝟭, achieve impressive results on a compact 1 billion-token training dataset, trained across multiple stages.

Share

CognitiveLab has introduced Ambari, an open-source Bilingual Kannada-English Large Language Models (LLMs) series. The initiative addresses the challenges posed by the dynamic landscape of LLMs, with a primary focus on bridging the linguistic gap between Kannada and English.

Its inaugural models, 𝗔𝗺𝗯𝗮𝗿𝗶-𝟳𝗕-𝗯𝗮𝘀𝗲-𝘃𝟬.𝟭 and 𝗔𝗺𝗯𝗮𝗿𝗶-𝟳𝗕-𝗜𝗻𝘀𝘁𝗿𝘂𝗰𝘁-𝘃𝟬.𝟭, achieve impressive results on a compact 1 billion-token training dataset, trained across multiple stages. You can find the models here.

In the blog post, CognitiveLab shares insights into the purpose behind Ambari and the meticulous approach taken during its development. The project is driven by the need to pioneer language adaptability within LLMs, pushing the boundaries of efficiency by training and fine tuning on a modest 1 billion-token dataset.

Ambari’s training process involves distinct stages, including pre-training, bilingual next token prediction/translation, instruct fine-tuning, and more. Efficient tokenization, a critical component, is achieved through a specialized model using SentencePiece, addressing challenges posed by Kannada text within open-source LLMs.

Continual pre-training with a curated dataset of 500 million tokens is highlighted, showcasing the commitment to open-source knowledge sharing with the availability of fully fine-tuned model weights on Hugging Face.

A pivotal addition to the training strategy is the phase of bilingual next token prediction, inspired by the Hathi series. Challenges in translation and fine-tuning are acknowledged, emphasizing the commitment to refining bilingual capabilities within Ambari.

The blog details supervised fine-tuning with low-rank adaptation, introducing a chat template structure for bilingual instruct fine-tuning. The final phase explores Direct Preference Optimization (DPO) using the Anthropic/hh-rlhf dataset, undergoing evaluation for its impact on performance.

Learnings and observations include occasional hallucinations, translation nuances, and the dilemma of full weight fine-tuning. The future roadmap for Ambari includes the incorporation of Romanized Kannada, refinement of data pipelines, and scaling the training dataset for continuous learning and model enhancement.

Interestingly, this is the second Kannada-based LLM. Recently, Mumbai-based software development company Tensoic released Kannada Llama, also known as Kan-LLaMA [ಕನ್-LLama] — a 7B Llama-2 model, LoRA PreTrained and FineTuned on ‘Kannada’ tokens.

Share
Picture of Siddharth Jindal

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.