The GPT-3 model released last year created a major buzz in the artificial intelligence field. Creating a language model of this magnitude was no mean feat. In a span of just a year, many other models have been introduced, each bigger and more powerful than the previous. The underlying principle in building these models remains the same — using deep monolithic architectures to understand how languages are used within text obtained from massive web crawls. Model training involves storing the parameters to understand language tasks and deriving highly abstract knowledge representations of facts, entities, and events that the model may need for solving these tasks.
NLP has made tasks such as answering questions, summarization and report translations, and sentiment analysis much easier. That said, the black-box nature of NLP poses a big hindrance to achieving the key goals. The commonly used generative based NLP usually encounters this problem. A suitable alternative is retrieval based NLP models. In the latter, models directly search for information in a text corpus to exhibit knowledge. These models leverage the representational strength of language models and address other challenges. Some of the good examples of these models are — REALM, RAG, Baleen, and ColBERT-QA.
The NLP Black Box Problem
Despite the success of large language models, they suffer from the following challenges:
- Currently, the models have already crossed the trillion parameter mark. This not only poses a significant environmental challenge, but given the high cost of training, many smaller companies are unable to train or deploy these models.
- If training and deploying these large models were not difficult enough, these models are also very static. In practice, adapting any such model requires expensive retraining and fine-tuning on a new corpus.
- The models encode knowledge into model weight by synthesising what they memorise from training examples. This makes it difficult to trace the sources that a model may use to make a specific prediction—this yields opacity to these models, which are then prone to generate fluent yet untrue statements.
The opacity of NLP models, in particular, makes training and deployment of models like T5 and GPT-3 extremely difficult since they are opaque in their knowledge representation and backing claims with provenance.
Retrieval Model to Rescue
As the name suggests, retrieval-based NLP models retrieve information to solve a task from a plugged-in text corpus. It allows NLP models to leverage the representational strength of language models sans the large architecture requirements, thereby offering transparent provenance for claims, enabling easy updation and adaptation.
Image credit: Stanford AI Blog
As per a Stanford blog, retrieval-based NLP models view tasks as open book exams. With these models, the knowledge is explicitly encoded in the form of a text corpus. The model then learns to search for passages and use the retrieved information for crafting knowledgeable responses. This kind of model decouples the capacity of models for understanding the text from how they store knowledge. These models offer three main advantages:
- Retrieval based models offer transparency. For example, when the model produces an answer, the user can read the sources it has retrieved and judge for their relevance and credibility.
- Retrieval models are generally smaller than their generative-based counterparts. Unlike black-box language models, the parameters no longer need to store an ever-growing list of facts. Instead, these parameters can be used for processing language and solving actual tasks.
- Retrieval based models pay much emphasis on learning general techniques for finding and connecting information from available sources. This helps in efficient updation and expansion of the retrieval knowledge store just by modifying the text corpus without disturbing the model’s capacity for finding and using information.
“There are many issues with large black box models. The most impactful and widespread of them is the amount of resources it utilises. The sheer amount of carbon emitted for training such black-box models is five times that of what a mid-sized car will produce in its lifetime. The second issue is that since the information derived from such large black-box models is unreliable, their use is very limited and often infeasible in sensitive scenarios. There are many issues like this, but for the sake of providing a succinct and coherent answer, I’m limiting myself to these!” said Ujjawal K. Panchal, CTO, Memechat.
“Retrieval based NLP models, on the other hand, are much much smaller. An example of this is ColBERT-QA, a retrieval-based model that is 400x smaller than the black box GPT-3 model. In stark contrast to generative models, the model can pinpoint the source of information used to answer any particular question; hence the answers provided are more reliable,” he further adds.
Stanford University has surveyed many emerging models in the NLP space. Among those, prominent examples include ColBERT for scaling up expressive retrieval to massive corpora; ColBERT-QA to answer open-domain questions using high-recall retrieval to the task accurately; Baleen to solve tasks that require information from independent sources using condensed retrieval architecture.