Meet తెలుగు Llama 

Telugu LLM Labs recently released two Telugu datasets – Romanised Telugu Pretraining dataset and SFT (Supervised Fine Tuning Dataset) In Telugu (native + romanised).

Share

Illustration by Nikhil Kumar

Last year, we curated a list of vernacular Llama-based models, among them was Telugu Llama. Back then, the model was still a work in-progress. However, it was recently made available on Hugging Face by its creators, Ravi Theja, and Ramsri Goutham Golla.

“The PR was slightly ahead of its time, so we had to catch up,” said Golla jokingly in an exclusive interview with AIM, hinting that our story served as catalyst, inspiring him to expedite the development of Telugu Llama.

Telugu Llama is a passion project for both Golla and Theja. Just last week, they introduced Telugu-LLM-Labs, a collaborative independent effort where they released datasets translated and romanised in Telugu. 

Next, they intend to release the TinyLlama-1.1B-Telugu-Romanization-Base and TinyLlama-1.1B-Telugu-Romanization-Instruct models. 

Hyderabad-based Golla studied and worked in the US for almost eight years before returning to India in 2018. He describes himself as a builder/engineer and loves creating SaaS apps. Golla has successfully developed two AI SaaS apps, with a combined ARR of $100K. Additionally, he takes AI courses on Udemy and his own platform.

On the other hand, Theja works as a developer advocate engineer at Llama Index. Before this role, he served as a senior ML engineer at Glance, where he worked on recommendation systems and GenAI applications.

Inspiration Behind Telugu Llama 

“The end goal that Ravi and I had was to create Quora-level questions and answers,” said Golla, adding that Quora has regional pages like Hi.quora and Telugu.quora, where users engage with regional questions and answers.

Moreover, he said that open source  models have caught up to the level of initial versions of OpenAI’s models, such as GPT-3.5. “So now, building something for regional languages makes sense because the quality of output matches what people expect,” he added.

Also, he underscores the need for a culturally rooted LLM. “The festivals that we celebrate, cultural norms adopted in marriage, and even religious sentiments are different. So, we need regionally rooted LLMs to provide context-specific queries and answers,” he said.

Data Collection

Telugu LLM Labs recently released two Telugu datasets – Romanised Telugu Pretraining dataset and SFT (Supervised Fine Tuning Dataset) In Telugu (native + romanised). The reason behind creating the romanised Telugu dataset is that much of the online conversations, such as WhatsApp or YouTube comments, happen in romanised Telugu. “Instead of typing “ఎలా ఉన్నారు?” (How are you?), people type “ela unnaru?” using a romanized script for most online interactions,” said Golla.  

“We created these two additional datasets on top of English datasets, but with only one catch. We further filtered them with NLP classification systems to remove the rows that are ‘English language specific’ or ‘coding related’, so that the resultant dataset is cleaner and more comprehensive,” he added. 

Further, they took CulturaX and romanized the first 108k rows from the culturaX_telugu dataset. “This dataset is ideal if you want to do additional pre-training for CLM (casual language model/next-word prediction) for a tiny LLM like TinyLlama 1.1B,” said Theja. 

Additionally, Golla and Theja are building custom scrapers for the most-popular news websites or TV channel websites, where they collect relevant articles. “When the time and quality is right we will release that. It will be one of the biggest contributions from Telugu LLM Labs,” said Golla. 

From the computing perspective, Telugu Llama received support from Jarvislabs.ai and several other GPU providers, though it primarily relied on its own computing resources. 

Golla highlighted that when they launched the initiative, they were ready to work with limited computing resources, ensuring that progress wouldn’t be hindered. Theja and Golla now plan to experiment with 3 billion parameter models that will generate text in Telugu and English.

Share
Picture of Siddharth Jindal

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India