Last updated January 27, 2023
In AI Origins & Evolution

The Hidden Cost of ChatGPT for Indian Languages

When we calculate the price of producing this article using the Ada model, the Hindi translation would require nearly 7X of tokens as well as 7X of the pricing in comparison to the same article produced in English. For a language like Kannada, the pricing is 11X more than what it is in English.

Published on January 27, 2023

by Lokesh Choudhary

Listen to this story

While everyone is thrilled about ChatGPT and the many wonders it seems to be doing, non-English speakers are curious to know whether it can be developed in their own languages. Indian users, for example, are trying hard to check if Indic languages can be implemented in ChatGPT.

Recently, Google launched project Vaani in collaboration with AI & Robotics Technology Park (ARTPARK) set up by the Indian Institute of Science (IISc). The project intended to gather extensive datasets of spoken languages and transcribed texts from every district in India.

Bhashini was another attempt by the Indian government to make AI and NLP (natural language processing) resources available to the larger public, including startups and developers, with the thought that it might provide an edge to the development of inclusive internet, which gives Indians easy access to the internet in their native languages.

Ramsri Goutham Golla, Founder, Questgen.ai, said that none of these projects, in India and beyond, highlights how expensive (and inefficient) it’d be to build a project like ChatGPT in non-English languages.

Tokenisation effect

In natural language processing models, the programme often splits paragraphs and sentences into smaller units so that it’s easier to assign meaning. This process is known as tokenisation. The first step of the NLP process is gathering the data (a sentence) and breaking it into understandable parts (words).

However, this changes with each unique language put into the system. For instance, when the model processes data in English, the tokens are typically low in count, thus requiring less computing power to generate response. In comparison, Hindi—which is a morphologically rich language—requires more tokens to represent the same amount of information. Additionally, since Hindi is written in Devanagari script, the number of tokens required for text representation goes up as each character is considered as a token.

As shown in the image below, ‘Lokesh Choudhary’ in English requires 7 tokens but the same input in Hindi requires 20 tokens—thus eliciting a significant disparity.

It is even more intriguing in the context of other vernacular languages in India, such as Kannada or Telugu. The tokens required for the same input go up to 35 (for Kannada) and 34 (for Telugu), as shown below.

These tokens inevitably affect the performance of the NLP model by slowing it down. However, to understand how slow exactly, AIM recorded a short video while chatting with ChatGPT in various languages. Initially when prompts were in Hinglish (i.e., in Hindi language but in English script), the model was able to understand the prompts and reply as fast as it does in the case of English prompts.

However, once the prompts were input in Devanagari script, the model began struggling, answering with a delay of nearly 30 seconds. Similar was the case with other languages, such as Bengali and Kannada. In contrast, similar queries in English generated responses almost immediately and in detail.

Calculating the price

Now that it’s sufficiently established that the tokenisation problem slows down computation in the case of non-English languages, one must wonder how expensive it can get when building an NLP model in Indic languages.

According to the data shared by OpenAI, the cost of training the ‘Ada’ model, which is the fastest among all of the models provided by OpenAI, the cost of training the dataset is at around $0.0004 per 1000 tokens. While the usage cost revolves around $0.1200 per 1000 tokens.

In contrast, the most powerful model among all—Davinci—takes around $0.03 per 1000 tokens to train while $0.12 per 1000 tokens for usage.

For example, when we calculate the price of producing this article using the Ada model, the Hindi translation would require nearly 7X of tokens as well as 7X of the pricing in comparison to the same article produced in English. For a language like Kannada, the pricing is 11X more than what it is in English.

So, while it took $1.2 to generate this article in English using the ‘Ada’ model, it’ll take around $8 to generate the same in Hindi and around $14.5 to generate in Kannada.

If we assume that the cost of training the GPT-3 is around $4.6 Million, using a Tesla V100 cloud, the cost of training the same model in Hindi language can be around $32 Million while the same in Kannada language would cost around $55 million dollars.

Hence, while the problem of collecting data of Indic languages might get solved by the efforts of ARTPARK, Google and Indian government with projects like Bhashini and Vaani, a new problem of developing a model which is 8X or 10X faster than GPT-3 and will cost exponentially more than it what it cost to build GPT-3 still lingers.

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

Lokesh Choudhary

Tech-savvy storyteller with a knack for uncovering AI's hidden gems and dodging its potential pitfalls. 'Navigating the world of tech', one story at a time. You can reach me at: lokesh.choudhary@analyticsindiamag.com.

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

India is Making its Own AI Servers

Pritam Bordoloi

PLI scheme marks the beginning of India ‘s manufacturing venture

GPT-5 Likely to be Released After the US Elections

Donna Eva

Generative AI Jobs in India can Fetch You up to Rs 1 Crore

Siddharth Jindal

Top Editorial Picks

Elon Musk Set to Meet Indian Spacetech Startups During Upcoming Visit

Shyam Nandan Upadhyay

Happiest Minds Technologies Acquires Macmillan Learning India, Expands Edutech Reach

Shritama Saha

Meta Releases Llama 3, Beats Claude 3 Sonnet and Gemini Pro 1.5

Mohit Pandey

Nothing Becomes the First Smartphone Company to Integrate OpenAI’s ChatGPT

Siddharth Jindal

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Featured

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Through the implementation of advanced data management methodologies, resilient data observability solutions, and cutting-edge AI frameworks, Course5 is spearheading the