Language models in Indian languages are being quickly developed. Although spoken by millions of people, the digital presence of the languages is surprisingly sparse.
Tamil Llama, a pioneering language model that is specifically designed to understand and interact in Tamil, demonstrating significant advancements in computational linguistics, particularly for Indian languages.
At MLDS 2024, Abhinand Balachandran noted that one of the remarkable capabilities of Tamil Llama is its proficiency in also handling complex coding tasks. He said, “It can efficiently write Python scripts for operations like uploading files to an S3 bucket, and even explaining what Boto3 is, which is pretty cool, right?”
Tamil Llama’s emergence brought about a series of language models in Indian languages. Abhinand stressed the importance of open-source status in promoting innovation and accessibility in AI. “The linguistic inclusivity and diversity of the language models that represent the Indian languages in spite of the challenges of finding enough datasets and hardware infrastructure are truly special,” he said.
Talking about the challenges, he said the performance of language models in non-English languages has historically been limited due to a lack of representation in digital data. “One of the central challenges in its development has been adapting existing models to suit the linguistic intricacies of Indian languages,” he explained.
The process of expanding vocabulary and optimising tokenizers for Tamil is a critical step in enhancing Tamil Llama’s effectiveness. However, directly fine-tuning existing models for low-resource languages like Tamil comes with its own set of challenges, particularly in sentence formation and contextual understanding. He said, “I’ve opted not to significantly expand the vocabulary, as this approach minimises disruption to the model’s existing knowledge base but this slows down the training process and the speed of text generation significantly.”
Despite these challenges, expanding the vocabulary will be beneficial as the scale increases. He further explained, “Currently, we’re training the model with only a few gigabytes of data, but as we increase this amount, the model will gain a broader understanding of the world, which is essential for direct fine-tuning.”
Interestingly, the performance of Tamil Llama in English has not been compromised by these adaptations. In some cases, it even shows enhanced performance, indicating the model’s robustness and adaptability.
Abhinand explained another significant gap in the field is the lack of benchmarks for evaluating models in Indian languages. This absence poses a challenge in objectively measuring the progress and effectiveness of such models.
Tamil Llama’s development process has highlighted limitations due to translation loss and the inability of models to respond in colloquial forms. “But this is only the beginning. As we collect more datasets and increase the scale, these models have immense potential for millions of people,” he concluded.