MITB Banner

Is There An Antidote To The Black Box Problem Of NLP

The opacity of NLP models, in particular, makes training and deployment of models like T5 and GPT-3 extremely difficult since they are opaque in their knowledge representation and backing claims with provenance.
Share
retrieval based nlp model

The GPT-3 model released last year created a major buzz in the artificial intelligence field. Creating a language model of this magnitude was no mean feat. In a span of just a year, many other models have been introduced, each bigger and more powerful than the previous. The underlying principle in building these models remains the same — using deep monolithic architectures to understand how languages are used within text obtained from massive web crawls. Model training involves storing the parameters to understand language tasks and deriving highly abstract knowledge representations of facts, entities, and events that the model may need for solving these tasks.

NLP has made tasks such as answering questions, summarization and report translations, and sentiment analysis much easier. That said, the black-box nature of NLP poses a big hindrance to achieving the key goals. The commonly used generative based NLP usually encounters this problem. A suitable alternative is retrieval based NLP models. In the latter, models directly search for information in a text corpus to exhibit knowledge. These models leverage the representational strength of language models and address other challenges. Some of the good examples of these models are — REALM, RAG, Baleen, and ColBERT-QA.

The NLP Black Box Problem

Despite the success of large language models, they suffer from the following challenges:

  • Currently, the models have already crossed the trillion parameter mark. This not only poses a significant environmental challenge, but given the high cost of training, many smaller companies are unable to train or deploy these models.
  • If training and deploying these large models were not difficult enough, these models are also very static. In practice, adapting any such model requires expensive retraining and fine-tuning on a new corpus.
  • The models encode knowledge into model weight by synthesising what they memorise from training examples. This makes it difficult to trace the sources that a model may use to make a specific prediction—this yields opacity to these models, which are then prone to generate fluent yet untrue statements.

The opacity of NLP models, in particular, makes training and deployment of models like T5 and GPT-3 extremely difficult since they are opaque in their knowledge representation and backing claims with provenance.

Retrieval Model to Rescue

As the name suggests, retrieval-based NLP models retrieve information to solve a task from a plugged-in text corpus. It allows NLP models to leverage the representational strength of language models sans the large architecture requirements, thereby offering transparent provenance for claims, enabling easy updation and adaptation.

Image credit: Stanford AI Blog

As per a Stanford blog, retrieval-based NLP models view tasks as open book exams. With these models, the knowledge is explicitly encoded in the form of a text corpus. The model then learns to search for passages and use the retrieved information for crafting knowledgeable responses. This kind of model decouples the capacity of models for understanding the text from how they store knowledge. These models offer three main advantages:

  • Retrieval based models offer transparency. For example, when the model produces an answer, the user can read the sources it has retrieved and judge for their relevance and credibility.
  • Retrieval models are generally smaller than their generative-based counterparts. Unlike black-box language models, the parameters no longer need to store an ever-growing list of facts. Instead, these parameters can be used for processing language and solving actual tasks.
  • Retrieval based models pay much emphasis on learning general techniques for finding and connecting information from available sources. This helps in efficient updation and expansion of the retrieval knowledge store just by modifying the text corpus without disturbing the model’s capacity for finding and using information.

“There are many issues with large black box models. The most impactful and widespread of them is the amount of resources it utilises. The sheer amount of carbon emitted for training such black-box models is five times that of what a mid-sized car will produce in its lifetime. The second issue is that since the information derived from such large black-box models is unreliable, their use is very limited and often infeasible in sensitive scenarios. There are many issues like this, but for the sake of providing a succinct and coherent answer, I’m limiting myself to these!” said Ujjawal K. Panchal, CTO, Memechat.

“Retrieval based NLP models, on the other hand, are much much smaller. An example of this is ColBERT-QA, a retrieval-based model that is 400x smaller than the black box GPT-3 model. In stark contrast to generative models, the model can pinpoint the source of information used to answer any particular question; hence the answers provided are more reliable,” he further adds.

Wrapping up

Stanford University has surveyed many emerging models in the NLP space. Among those, prominent examples include ColBERT for scaling up expressive retrieval to massive corpora; ColBERT-QA to answer open-domain questions using high-recall retrieval to the task accurately; Baleen to solve tasks that require information from independent sources using condensed retrieval architecture.

PS: The story was written using a keyboard.
Share
Picture of Shraddha Goled

Shraddha Goled

I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India