Last updated February 19, 2024
In AI Origins & Evolution

Meet తెలుగు Llama

Telugu LLM Labs recently released two Telugu datasets – Romanised Telugu Pretraining dataset and SFT (Supervised Fine Tuning Dataset) In Telugu (native + romanised).

Share

Illustration by Nikhil Kumar

Published on February 15, 2024

by Siddharth Jindal

Last year, we curated a list of vernacular Llama-based models, among them was Telugu Llama. Back then, the model was still a work in-progress. However, it was recently made available on Hugging Face by its creators, Ravi Theja, and Ramsri Goutham Golla.

“The PR was slightly ahead of its time, so we had to catch up,” said Golla jokingly in an exclusive interview with AIM, hinting that our story served as catalyst, inspiring him to expedite the development of Telugu Llama.

Telugu Llama is a passion project for both Golla and Theja. Just last week, they introduced Telugu-LLM-Labs, a collaborative independent effort where they released datasets translated and romanised in Telugu.

Next, they intend to release the TinyLlama-1.1B-Telugu-Romanization-Base and TinyLlama-1.1B-Telugu-Romanization-Instruct models.

Hyderabad-based Golla studied and worked in the US for almost eight years before returning to India in 2018. He describes himself as a builder/engineer and loves creating SaaS apps. Golla has successfully developed two AI SaaS apps, with a combined ARR of $100K. Additionally, he takes AI courses on Udemy and his own platform.

On the other hand, Theja works as a developer advocate engineer at Llama Index. Before this role, he served as a senior ML engineer at Glance, where he worked on recommendation systems and GenAI applications.

Inspiration Behind Telugu Llama

“The end goal that Ravi and I had was to create Quora-level questions and answers,” said Golla, adding that Quora has regional pages like Hi.quora and Telugu.quora, where users engage with regional questions and answers.

Moreover, he said that open source models have caught up to the level of initial versions of OpenAI’s models, such as GPT-3.5. “So now, building something for regional languages makes sense because the quality of output matches what people expect,” he added.

Also, he underscores the need for a culturally rooted LLM. “The festivals that we celebrate, cultural norms adopted in marriage, and even religious sentiments are different. So, we need regionally rooted LLMs to provide context-specific queries and answers,” he said.

Data Collection

Telugu LLM Labs recently released two Telugu datasets – Romanised Telugu Pretraining dataset and SFT (Supervised Fine Tuning Dataset) In Telugu (native + romanised). The reason behind creating the romanised Telugu dataset is that much of the online conversations, such as WhatsApp or YouTube comments, happen in romanised Telugu. “Instead of typing “ఎలా ఉన్నారు?” (How are you?), people type “ela unnaru?” using a romanized script for most online interactions,” said Golla.

“We created these two additional datasets on top of English datasets, but with only one catch. We further filtered them with NLP classification systems to remove the rows that are ‘English language specific’ or ‘coding related’, so that the resultant dataset is cleaner and more comprehensive,” he added.

Further, they took CulturaX and romanized the first 108k rows from the culturaX_telugu dataset. “This dataset is ideal if you want to do additional pre-training for CLM (casual language model/next-word prediction) for a tiny LLM like TinyLlama 1.1B,” said Theja.

Additionally, Golla and Theja are building custom scrapers for the most-popular news websites or TV channel websites, where they collect relevant articles. “When the time and quality is right we will release that. It will be one of the biggest contributions from Telugu LLM Labs,” said Golla.

From the computing perspective, Telugu Llama received support from Jarvislabs.ai and several other GPU providers, though it primarily relied on its own computing resources.

Golla highlighted that when they launched the initiative, they were ready to work with limited computing resources, ensuring that progress wouldn’t be hindered. Theja and Golla now plan to experiment with 3 billion parameter models that will generate text in Telugu and English.

Access all our open Survey & Awards Nomination forms in one place