How NLP Can Tackle The Challenge Of Multiple Languages

Published on May 6, 2020
by Vishal Chawla

Natural language processing (NLP) is disrupting various industries, making it easier for humans to communicate with computers. But given there are more than 6900 languages in the world, it can be incredibly difficult to make NLP models for all of them.

In India itself, there are different dialects of Hindi, which creates a challenge for NLP professionals to build models that fit for different languages and dialects. Depending on the availability of labelled data, different techniques may have to be applied to build multilingual business AI. However, It’s pretty hard for AI systems to adapt to so many languages.

But, what if the data is in multiple languages such as enterprises which operate in different nations. So, for an NLP project, you can have models built from scratch using labelled datasets in each specific language. But that’s not very efficient, especially for a country like India where so many different languages are spoken in various geographies.

What Makes Multilingual NLP Challenging

While there are pre-trained word embeddings in different languages, all of them may be in different vector spaces. Which means that similar words would represent different vector representations, because of the natural characteristics of a specific language.

This makes multiple language NLP apps challenging. It takes a lot of labelled data, processes the information, learns patterns and produces prediction models. When we need to build NLP on a text containing different languages, we may look at multilingual word embeddings for NLP models that can effectively scale.

A major issue with NLP systems across the world is the number of languages that exist apart from English, and the fact that there is a dearth of data which can be used to train independent NLP models. But the good news is that if not all, many languages share similar structures, which can promote the transfer of learning.

Universal Models Can Come To The Rescue

Multilingual models for new languages can be created using transfer learning and cross-lingual embeddings. Expanding NLP models to new languages typically involves annotating completely new data sets for each language, which is time and resource-expensive.

To avoid these tedious and costly tasks, you can deploy cross-lingual embeddings to enable knowledge transfer from languages with sufficient training data to low-resource languages. Cross-lingual embeddings aim to represent words in multiple languages in a shared vector space by capturing semantic similarities across languages.

Recently, we have witnessed how innovation in deep learning has given way to techniques that possess general-purpose multilingual representations such as mBERT. Such systems can hold tremendous potential for learning across various languages and building better NLP applications that depend on reasoning about different levels of syntax or semantics across languages.

Research from Department of Computer Science at Johns Hopkins University has shown how Multilingual BERT (M-BERT), released in 2018 is great at cross-lingual model transfer. Also, multilingual embeddings can be used to scale NLP models with different languages other than just English. These can be built using semantic similarities and multilingual natural language understanding models between two languages.

Access all our open Survey & Awards Nomination forms in one place >>

Vishal Chawla

Vishal Chawla is a senior tech journalist at Analytics India Magazine and writes about AI, data analytics, cybersecurity, cloud computing, and blockchain. Vishal also hosts AIM's video podcast called Simulated Reality- featuring tech leaders, AI experts, and innovative startups of India.

Watch More

How NLP Can Tackle The Challenge Of Multiple Languages

What Makes Multilingual NLP Challenging

Universal Models Can Come To The Rescue

Vishal Chawla

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.