Is There An Antidote To The Black Box Problem Of NLP

The opacity of NLP models, in particular, makes training and deployment of models like T5 and GPT-3 extremely difficult since they are opaque in their knowledge representation and backing claims with provenance.
retrieval based nlp model

The GPT-3 model released last year created a major buzz in the artificial intelligence field. Creating a language model of this magnitude was no mean feat. In a span of just a year, many other models have been introduced, each bigger and more powerful than the previous. The underlying principle in building these models remains the same — using deep monolithic architectures to understand how languages are used within text obtained from massive web crawls. Model training involves storing the parameters to understand language tasks and deriving highly abstract knowledge representations of facts, entities, and events that the model may need for solving these tasks.

NLP has made tasks such as answering questions, summarization and report translations, and sentiment analysis much easier. That said, the black-box nature of NLP poses a big hindrance to achieving the key goals. The commonly used generative based NLP usually encounters this problem. A suitable alternative is retrieval based NLP models. In the latter, models directly search for information in a text corpus to exhibit knowledge. These models leverage the representational strength of language models and address other challenges. Some of the good examples of these models are — REALM, RAG, Baleen, and ColBERT-QA.

The NLP Black Box Problem

Despite the success of large language models, they suffer from the following challenges:

  • Currently, the models have already crossed the trillion parameter mark. This not only poses a significant environmental challenge, but given the high cost of training, many smaller companies are unable to train or deploy these models.
  • If training and deploying these large models were not difficult enough, these models are also very static. In practice, adapting any such model requires expensive retraining and fine-tuning on a new corpus.
  • The models encode knowledge into model weight by synthesising what they memorise from training examples. This makes it difficult to trace the sources that a model may use to make a specific prediction—this yields opacity to these models, which are then prone to generate fluent yet untrue statements.

The opacity of NLP models, in particular, makes training and deployment of models like T5 and GPT-3 extremely difficult since they are opaque in their knowledge representation and backing claims with provenance.

Retrieval Model to Rescue

As the name suggests, retrieval-based NLP models retrieve information to solve a task from a plugged-in text corpus. It allows NLP models to leverage the representational strength of language models sans the large architecture requirements, thereby offering transparent provenance for claims, enabling easy updation and adaptation.

Image credit: Stanford AI Blog

As per a Stanford blog, retrieval-based NLP models view tasks as open book exams. With these models, the knowledge is explicitly encoded in the form of a text corpus. The model then learns to search for passages and use the retrieved information for crafting knowledgeable responses. This kind of model decouples the capacity of models for understanding the text from how they store knowledge. These models offer three main advantages:

  • Retrieval based models offer transparency. For example, when the model produces an answer, the user can read the sources it has retrieved and judge for their relevance and credibility.
  • Retrieval models are generally smaller than their generative-based counterparts. Unlike black-box language models, the parameters no longer need to store an ever-growing list of facts. Instead, these parameters can be used for processing language and solving actual tasks.
  • Retrieval based models pay much emphasis on learning general techniques for finding and connecting information from available sources. This helps in efficient updation and expansion of the retrieval knowledge store just by modifying the text corpus without disturbing the model’s capacity for finding and using information.

“There are many issues with large black box models. The most impactful and widespread of them is the amount of resources it utilises. The sheer amount of carbon emitted for training such black-box models is five times that of what a mid-sized car will produce in its lifetime. The second issue is that since the information derived from such large black-box models is unreliable, their use is very limited and often infeasible in sensitive scenarios. There are many issues like this, but for the sake of providing a succinct and coherent answer, I’m limiting myself to these!” said Ujjawal K. Panchal, CTO, Memechat.

“Retrieval based NLP models, on the other hand, are much much smaller. An example of this is ColBERT-QA, a retrieval-based model that is 400x smaller than the black box GPT-3 model. In stark contrast to generative models, the model can pinpoint the source of information used to answer any particular question; hence the answers provided are more reliable,” he further adds.

Wrapping up

Stanford University has surveyed many emerging models in the NLP space. Among those, prominent examples include ColBERT for scaling up expressive retrieval to massive corpora; ColBERT-QA to answer open-domain questions using high-recall retrieval to the task accurately; Baleen to solve tasks that require information from independent sources using condensed retrieval architecture.

Download our Mobile App

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Recent Stories

Our Upcoming Events

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Can OpenAI Save SoftBank? 

After a tumultuous investment spree with significant losses, will SoftBank’s plans to invest in OpenAI and other AI companies provide the boost it needs?

Oracle’s Grand Multicloud Gamble

“Cloud Should be Open,” says Larry at Oracle CloudWorld 2023, Las Vegas, recollecting his discussions with Microsoft chief Satya Nadella last week. 

How Generative AI is Revolutionising Data Science Tools

How Generative AI is Revolutionising Data Science Tools

Einblick Prompt enables users to create complete data workflows using natural language, accelerating various stages of data science and analytics. Einblick has effectively combined the capabilities of a Jupyter notebook with the user-friendliness of ChatGPT.