Listen to this story
|
Llama 2 is currently one of the sought-after open-source models globally, with 757K downloads on HuggingFace. Its free availability for research and commercial purposes make it attractive to a broad user base, including researchers, developers, and hobbyists.
It has gained popularity in India and other countries worldwide, where users have created ‘Local Llamas’ based on the local language of the region. In models like GPT-3.5, there is a challenge with ‘tokenisation,’ where Indic language text is not efficiently represented.
Compared to massive LLMs like GPT-4 or 3.5, Llama 2 requires less computational power and training data. This makes it more feasible to train and run on local hardware, even in areas with limited infrastructure.
Moreover, Llama 2 can be easily fine-tuned for specific tasks and domains using smaller datasets of local language text. This allows developers to tailor the model to their specific language and needs, improving its performance and relevance.
OpenHathi
Recently in India, Sarvam AI introduced OpenHathi-Hi-v0.1, marking the debut of the first Hindi LLM in the OpenHathi series. Built on a cost-effective platform, this model, extending from Llama2-7B, exhibits performance akin to GPT-3.5 for Indic languages.
OpenHathi, featuring a 48K-token extension of Llama2-7B’s tokenizer, undergoes a two-phase training process. The initial phase focuses on embedding alignment, aligning randomly initialized Hindi embeddings, followed by bilingual language modeling, teaching the model cross-lingual attention across tokens.
The model demonstrates robust performance across various Hindi tasks, comparable to, if not surpassing, GPT-3.5, while maintaining English proficiency. Sarvam AI’s evaluation includes non-academic, real-world tasks alongside standard Natural Language Generation (NLG) tasks. Evaluations against GPT-3.5 generation with GPT-4 as the judge revealed superior performance in Hindi, both in native and Romanised scripts.
Tamil Llama
Kaggle ML Engineer and Kaggle Master Abhinand Balachandra, recently introduced Tamil Llama, an Indic LLM engineered specifically to elevate the Tamil language domain. This AI model is built on top of Meta’s Llama 2.
It was trained with an additional 16,000 Tamil tokens, aiming to achieve superior text generation and comprehension in the Tamil language. This model serves as an extension of the LLaMA model and has been enhanced by incorporating extra Tamil tokens, utilizing the LoRA methodology for efficient training.
It presents four distinct variations: Tamil LLaMA 7B, 13B, 7B Instruct, and 14B Instruct. Throughout the training phase, the model’s vocabulary has been expanded to encompass 16,000 Tamil tokens, supplementing the original 32K tokens.
Telugu Llama
Ramsri Goutham Golla from Segmind.com is currently working on Telugu Llama. Meticulously trained on LLama 2, the popular language model, showcasing remarkable efficiency in token count for Telugu text.
Interestingly, Telugu language consumes less tokens as compared to English as per Golla. This breakthrough not only promises faster and cost-effective text generation in Telugu but also sets the stage for LLama 2 to excel in Indic languages, revolutionising the landscape of natural language processing.
Odia Generative AI
odia_llama2_7B_v1 created by Odia Generative AI is based on Llama2-7b and fine tuned with a 180k Odia instruction set. This set includes translated data from open-source resources and a purposefully crafted domain knowledge instruction set. The result is a model that effectively understands Odia instructions and generates responses, demonstrating its practical utility for the nuances of the Odia language.
SeaLLMs – Large Language Models for Southeast Asia
Alibaba Group Holding’s research division, Damo Academy, has introduced LLMs specifically designed for Southeast Asian languages. SeaLLMs are built upon the Llama-2 model and further advanced through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning to better capture the intricacies of regional languages.
The Southeast Asia LLM (SeaLLM) was pre-trained on Vietnamese, Indonesian, Thai, Malay, Khmer, Lao, Tagalog, and Burmese data sets, and has outperformed other open-source models in linguistic and safety tasks. SeaLLMs exhibit remarkable proficiency in language understanding and generation tasks, presenting a formidable challenge to dominant models like ChatGPT-3.5, especially in Southeast Asian (SEA) languages.
VinaLLaMA: LLaMA-based Vietnamese Foundation Model
VinaLLaMA, a foundational LLM designed specifically for the Vietnamese language. VinaL- LaMA, built on top of LLaMA-2, represents a vital stride towards linguistic inclusivity in AI, adeptly addressing the syntactic and semantic intricacies of Vietnamese.
LLaMAntino: LLaMA 2 Models for Effective Text Generation in Italian Language
LLaMAntino is the Italian iteration of Llama 2, developed by researchers at the University of Bari Aldo Moro, Italy. Through the fine-tuning of pre-trained LLaMA 2 models with 13 and 7 billion parameters and using a substantial dataset of Italian text, they created the LLaMAntino-2-7b-hf-ITA and LLaMAntino-2-13b-hf-ITA models.
This model aims to provide Italian NLP researchers with a tool to tackle tasks such as information extraction and closed qa.
Dutch Llama v2 13B
It is a Dutch model based on Llama 2. This model is a fine-tuned version of BramVanroy/llama2-13b-ft-mc4_nl_cleaned_tiny on the BramVanroy/dutch_chat_datasets dataset on a context of 4096 tokens.
Depending on the input, the model can provide satisfactory outcomes, taking into account its 13B size and limited pretraining in Dutch. However, it’s essential to note that the model lacks human feedback during training and lacks safeguards, potentially leading to unexpected or offensive content based on the query.