Natural language processing (NLP) has made several remarkable breakthroughs in recent years by providing implementations for a range of applications including optical character recognition, speech recognition, text simplification, question-answering, machine translation, dialogue systems and much more.
With the help of NLP, systems learn to identify spam emails, suggest medical articles or diagnosis related to a patient’s symptoms, etc. NLP has also been utilised as a critical ingredient in case of crucial decision-making systems such as criminal justice, credit, allocation of public resources, sorting a list of job candidates, to name a few.
However, despite all these critical use cases, NLP is still lagging and faces the problem of underrepresentation. For instance, one of the significant limitations of NLP is the ambiguity of words in languages. The ambiguity and imprecise characteristics of the natural languages make NLP difficult for machines to implement.
In recent research, a team of researchers from Clarkson University and Iona College explored the typical NLP pipeline and took a critical look into how even when a dialect is technically supported; substantial cautions remain to prevent full participation in the model. The researchers used eight dialects- English, Chinese, Urdu, Farsi, Arabic, French, Spanish, and Wolof.
The researchers said, “Over 7000 languages are being spoken today by humans, and the typical NLP pipeline usually underrepresents speakers of most of the languages while amplifying the voices of speakers of other remaining languages.”
Lack of Pre-Trained Models
A standard NLP-pipeline comprises several steps such as gathering the corpora, processing the corpora into text format, identifying the key language elements, training the NLP models, and then using these models to answer predictive questions.
For some languages, there are well-developed resources available throughout the stages of this pipeline. Also, for some languages, pre-trained models exist, allowing research or development teams to jump right to the last step.
According to the researchers, pre-training a model from scratch using the large corpora necessary for meaningful NLP-results is costly. At the same time, fine-tuning is much less expensive than pre-training. Which is why, when a team can download a pre-trained model, they usually avoid this substantial overhead.
This also makes NLP-based results available to a great variety of people; however, only if such a pre-trained model is available for their specific language. Also, when these easy to use pre-trained models exist for only a few spoken languages, it further aggravates the disparity in representation as well as participation.
The researchers stated that often NLP researchers recognise the degree to which some languages are under-represented. Still, the process or methods in which the effect of this degree is magnified throughout the NLP-tool chain often comes with substantial caveats which are less discussed.
They added, “Despite huge investments in multilingual support in projects like BERT, Word2Vec, Natural Language Toolkit (NLTK), among others, we are still making NLP-guided decisions that systematically and dramatically underrepresent the voices of much of the world.”
There are various circumstances which led to the issues of underrepresented languages. These issues can be mitigated by following measures, such as improving the accountability of
algorithms by incorporating feedback, lowering technical barriers to allow more participation in ML classifiers, the importance of non-technical feedback into algorithms to strengthen accountability and more.
Lack of Support in NLP-Tools
The researchers further investigated how the lack of representation in the early stages of the NLP pipeline is amplified through the NLP-tool chain, crowning in reliance on easy-to-use pre-trained models that effectively prevents all but the most highly resourced teams from including diverse voices.
They stated that a vast majority of NLP tools are mostly developed for English. Even when the tools provide support for other languages, it often lags in robustness, accuracy and efficiency. As an example, they elaborated the evolution of the popular NLP model, BERT.
Released in 2018 by Google, the original BERT models were English only, but soon after that a Chinese and a Multilingual model of BERT were also released. However, it has been found that single language models are acknowledged to have advantages over the multilingual model. Also, when the researchers make some advancements, it is usually made available in the English language.
Also, the Multilingual BERT model was designed and evaluated using the XNLI dataset, which is a version of MultiNLI where the dev and test sets have been translated into 15 languages. The researchers stated that for all these 15 languages, the model showed patterns of lower accuracy for some languages and also showed that the individual language models for English and Chinese give ~3% advantage for those languages.
The researchers said that lack of representation at each stage of the pipeline adds to lack of representation in later stages of the pipeline. They also suggested that even when tools technically support a given language, there are other challenges such as high error rates and lack of testing that prevent full participation and representation of the languages. Overcoming these challenges can substantially overcome underrepresentation of languages that NLP faces.
Read the paper here.