Natural language processing (NLP) is disrupting various industries, making it easier for humans to communicate with computers. But given there are more than 6900 languages in the world, it can be incredibly difficult to make NLP models for all of them.
In India itself, there are different dialects of Hindi, which creates a challenge for NLP professionals to build models that fit for different languages and dialects. Depending on the availability of labelled data, different techniques may have to be applied to build multilingual business AI. However, It’s pretty hard for AI systems to adapt to so many languages.
But, what if the data is in multiple languages such as enterprises which operate in different nations. So, for an NLP project, you can have models built from scratch using labelled datasets in each specific language. But that’s not very efficient, especially for a country like India where so many different languages are spoken in various geographies.
What Makes Multilingual NLP Challenging
While there are pre-trained word embeddings in different languages, all of them may be in different vector spaces. Which means that similar words would represent different vector representations, because of the natural characteristics of a specific language.
This makes multiple language NLP apps challenging. It takes a lot of labelled data, processes the information, learns patterns and produces prediction models. When we need to build NLP on a text containing different languages, we may look at multilingual word embeddings for NLP models that can effectively scale.
A major issue with NLP systems across the world is the number of languages that exist apart from English, and the fact that there is a dearth of data which can be used to train independent NLP models. But the good news is that if not all, many languages share similar structures, which can promote the transfer of learning.
Universal Models Can Come To The Rescue
Multilingual models for new languages can be created using transfer learning and cross-lingual embeddings. Expanding NLP models to new languages typically involves annotating completely new data sets for each language, which is time and resource-expensive.
To avoid these tedious and costly tasks, you can deploy cross-lingual embeddings to enable knowledge transfer from languages with sufficient training data to low-resource languages. Cross-lingual embeddings aim to represent words in multiple languages in a shared vector space by capturing semantic similarities across languages.
Recently, we have witnessed how innovation in deep learning has given way to techniques that possess general-purpose multilingual representations such as mBERT. Such systems can hold tremendous potential for learning across various languages and building better NLP applications that depend on reasoning about different levels of syntax or semantics across languages.
Research from Department of Computer Science at Johns Hopkins University has shown how Multilingual BERT (M-BERT), released in 2018 is great at cross-lingual model transfer. Also, multilingual embeddings can be used to scale NLP models with different languages other than just English. These can be built using semantic similarities and multilingual natural language understanding models between two languages.