Council Post: Coming to Understanding Why India’s Diversity Requires Large Language Models to Advance AI

Across healthcare, retail, telecommunications, customer service and operations, large language models (LLMs) are enabling Indian government organizations, research institutions and enterprises to overcome communications challenges and find new connections.

Share

Published on August 10, 2022

by Vishal Dhupar

Listen to this story

Did she say “guma hua saamaan,” “Gumi’ā samāna” or “Iḻanta cāmāṉkal?”

In the U.S., “lost luggage” is a relatively easy phrase for an airline’s AI chatbots to understand before directing a customer’s call to an agent. This is not the case across India, with its 22 official languages — including Hindi, Punjabi and Kannada — and more than 19,500 dialects. There is a growing population of users consuming Indian language content (print and digital, for government and businesses) and the growth of the large language model (LLM) is benefiting the research and development of tools for Indian languages. LLM is the first step to building local language technologies with a common framework across various languages in the country.

On track to surpass China as the world’s most populous country, India — with its diversity, massive amounts of stored data, and need to make communication more efficient and seamless — is a ripe test bed to deploy LLMs and natural language processing (NLP) at scale. Because of the nation’s unique linguistic diversity, Indian researchers have even coined a specific term for addressing the need to make communication more efficient and seamless: Indic language understanding, or ILU.

The Indian government, higher education and research (HER), and private sectors are turning to these LLMs to better customer service, refine operations, deliver greater business intelligence and create new business models – all while enterprises work to overcome labor shortages and optimize complex supply chains.

Indeed, having machines understand written words — or natural language understanding (NLU) — basically maps every word to a meaningful vector and helps bring huge benefits to businesses. Large or small, every organization has an abundance of text-based data – including legal documents, patient records, financial documents, code, websites and emails. LLMs provide the ability to use that unstructured information and develop new services to help businesses stay competitive.

LLMs and the India Opportunity

The Indian government, HER segment and private sector have huge datasets stored in various languages and formats (e.g., text, audio, video and more). These datasets need to be cleaned and labeled, and can become the foundation to train LLMs. Fundamentally, ILU research seeks to make it easier for India’s billion-plus citizens to communicate in different languages and use the power of AI for multimodal NLU/P.

To begin technologies with a common language framework, the Indian private and governmental sectors must clean and label all of their data to use as the foundation for training LLMs.

India has been an important market in which tech giants have made breakthrough progress over the past decade. However, the success of these companies require high-compute capabilities, crucial data comprising diverse vocabulary, speakers, etc., and other state-of-the-art transformer models.

Use cases are myriad. They include using a smartphone camera to scan signs and generate text or voice output from one Indian language to another; multilingual voice translation and response by service bots and concierges; translation on a mobile device of voice in one language to screen text in another; as well as making government-to-citizen services multilingual and therefore more efficient.

Bolting heavy-duty computational power onto these large datasets and LLMs — be it in a data center, in the cloud, at the edge or in a hybrid format — leads to miracles. These gains shorten the data preparation time (extract transform load or ETL pipeline), accelerate training time, and enable faster deployment, inference and continuous model accuracy improvements.

To summarize, a combination of existing big data, ever-large and more accurate LLMs, scalable accelerated computing, and Indian research efforts can use AI to solve one of the country’s greatest challenges: achieving seamless, fast, and accurate multilingual and modal communication.

Massive Models Put Troves of Business Documents to Work

The size of LLMs has been increasing 10x annually for the last few years, with state-of-the-art models now containing hundreds of billions to trillions of parameters. As these models grow in complexity and size, their capabilities improve as well.

Fast, efficient and accurate ILU technology is not far from reality, given the recent evolution of language models – including Open AI GPT-3; Switch Transfer, GLAM and PALM from Google; Turing NLG from Microsoft; Gopher from DeepMind; and Jurassic from AI21 Labs. Models like OpenAI’s GPT-3 and the new open-source BLOOM can perform specialized tasks such as answering questions, writing essays, summarizing text, translating languages and generating computer code. This enables enterprises to build customized NLP applications tailored to understand an enterprise’s unique vocabulary, customer relationships and datasets on which their business runs.

Transformer-based LLMs are reshaping today’s AI landscape, and NVIDIA has upped the game with the NVIDIA NeMo Megatron framework for training language models with trillions of parameters. NeMo Megatron offers end-to-end capabilities from data curation to training to inference and evaluation. It opens doors for enterprises across the world to understand, develop and deploy their own LLMs to help them build domain-specific chatbots, personal assistants and other AI applications. With 530 billion parameters, the Megatron-Turing NLG model — one of the world’s most powerful transformer language models — leverages the NeMo Megatron framework and was used in the development of BLOOM, the world’s largest open-source multilingual LLM.

Advancing Healthcare with Language Understanding

LLMs can be useful across the healthcare industry, as well.

Some 80% of the information in electronic health records consists of unstructured clinical notes. NLP models can be used to easily surface relevant clinical data for downstream diagnostic and predictive tasks. These models also improve the patient experience with smarter chatbots and can reduce physician burnout through improved transcription and summarization tools, faster documentation, and better decision-support systems.

A Launch Pad for Learning More on Large Language Models

As LLMs are applied to new use cases and domains from communications to healthcare, their complexity and size have increased exponentially. Developing LLMs requires computing on a scale that goes beyond the mainstream, as well as software and frameworks built to connect and synthesize across all of the model’s layers. Deploying an LLM into production requires expertise, and customized labs are available to help enterprises take a step forward in putting language to work for their business.

This article is written by a member of the AIM Leaders Council. AIM Leaders Council is an invitation-only forum of senior executives in the Data Science and Analytics industry. To check if you are eligible for a membership, please fill out the form here.

Access all our open Survey & Awards Nomination forms in one place

Vishal Dhupar

Vishal Dhupar heads Strategy & Market Development at NVIDIA India. Vishal is chartered to develop and maintain relations with government, industry bodies, and partners. With over three decades of leadership and customer engagement experience in the technology industry, Vishal’s passion for customer success is the foundation of the culture of his organisation.