Overcoming The Language Barrier In NLP

It is important to build models that include the diverse range of languages spoken all over the world

Natural language processing is a transformative technology and has been generating a lot of buzz in the last few years for its large-scale impact. But most of the research and models built focus on mechanisms that work for the English language. Even if models are built for other languages, they have been mainly around popular languages. 

There are around 7000 languages spoken in this world, with the Asian continent having the largest percentage in terms of the number of languages spoken. If we do not cater to the huge spectrum of languages that exist, we are leaving out a huge section of the world from the benefits of the advancement of technology. The need exists for developing speech recognition models for other languages to make technology more inclusive.

Difficult to build

Though researchers and tech companies have realised that the introduction of NLP to other languages will be very useful from a business as well as a societal standpoint, it is quite difficult to build the models in other languages as the availability of the right and sufficient data set is a huge problem. We need a large dataset to train and test the algorithm while building an NLP model. Though large populations may speak a particular language, obtaining such data sets may still be difficult. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

If a small data set is available, we would need to have distinct patterns in it to build the model. The data from the languages needs to be cleaned too. Many languages have symbols and other characters that all types of computer systems may not recognise without proper modification. Making it suitable for such systems may be time-consuming and costly. If a company develops a model for other languages, it must open source it as it is still an emerging area and others can effectively learn from it and build on it.


We have made some progress in the last few years to build models across a diverse spectrum of languages.

  • In 2020, Meta introduced the M2M-100, a multilingual machine translation (MMT) model that translates between any pair of 100 languages without relying on English data. It said that M2M-100 is trained on a total of 2,200 language directions. The goal of building such a model is to make the quality of translations worldwide better, especially those who speak low-resource languages, claimed Meta.
  • University of Waterloo’s David R. Cheriton School of Computer Science introduced AfriBERTa, which uses deep-learning techniques to achieve state-of-the-art results for low-resource languages. It said that AfriBERTa works specifically with 11 African languages, like Amharic, Hausa, and Swahili, spoken cumulatively by more than 400 million people. The mechanism achieves output comparable to the best existing models despite learning from just one gigabyte of text, as per the university.
  • In September, IIT Bombay launched Project Udaan that helps translate textbooks and other study material in engineering and other streams from English to Hindi and other Indian languages. It is a donation-based and artificial intelligence-based translation ecosystem.

How will it help?

Natural language processing finds its usage in a diverse range of areas, such as summarisation, question answering, sentence similarity, translation, token classification and many more. If it can penetrate into less popular languages, it will be immensely beneficial for:

  • Understanding and analysing emotions on various social media platforms and e-commerce websites comments where a large section of people express in their mother tongue and not English. This can be very beneficial for businesses for feedback and improvement.
  • Better customer service and engagement as customers mostly like to talk to chatbots or virtual assistants in their native language.
  • Expansion to diverse categories will improve the outcomes and accuracy of the technology.
  • Right content is available to the users in their mother tongue – based on their choices and past patterns.
  • Penetration of technology to non-popular languages will benefit society.

We have to make sure that the benefits of technology are accessible to everyone for society to progress. A great start to this will be through the penetration of new-age technologies across boundaries.

Sreejani Bhattacharyya
I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at sreejani.bhattacharyya@analyticsindiamag.com

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox