“Before about 2018, technologies that could answer questions tended to be brittle, hard to build, and even harder to scale. The advent of neural language models changed that.”
Khattab et al.
Modern day search engines are powered by incredibly large language models (eg: BERT) that do well with Q&A format. Current language models have already breached the 1 trillion parameters mark and are poised to get bigger and smarter. These language models(LMs) are trained on massive quantities of unstructured data. One of the main reasons why large pre-trained LMs are so successful is that they learn highly effective contextual representations.
But, how do we know the results are right? It is true that the internet has made it easier to retrieve information, thanks to the ranking algorithms. But the lack of provenance of these search results allows misinformation to flourish.
A group of researchers from Stanford University have proposed models that can enable AI to comprehend and respond in the most efficient way possible. The researchers wrote that users would usually ask follow-up questions to their initial queries to arrive at actual facts. One might Google the year of establishment of Stanford and then will follow up by asking, “What is the source for this?” and it might bring up “The Stanford University About page.” This does help, but one might wonder if the model had instead produced the text from “Wikipedia”?
Even if the larger language models find that sweet spot for retrieval and contextualisation, they might still find information removal to be challenging. We have come across news where the regulatory authorities of the state or even an individual can ask the internet companies to get rid of sensitive data. Data obsolescence can render models useless. According to the researchers at Stanford, there is no reliable way to remove specific information from language models as a direct consequence of the way information is stored. “Applying simple keyword filters on their outputs is not a panacea, as these models might still find ways of expressing the same content in new terms.”
This is where Neural IR models come in handy. According to the Stanford researchers, models such as ColBERT not only make the whole process of Q&A more relatable but also help maintain the effectiveness of the models in the event of deletion of information. Neual IR paradigm is said to be in use at least since 2019 by Google and Microsoft for their respective search engines and it is still a hot area of research.
“Neural IR is already being used by Google and Microsoft for their search engines.”
BERT significantly improved the search precision. But, it comes at a cost. The computation required increases latency. To avoid this latency, researchers started incorporating traditional retrieval methods into the language models. But, now precision has become a challenge. This is why models like ColBERT were introduced–to make the best of both worlds; efficiency and contextualisation.
ColBERT is a ranking model based on contextualised late interaction over BERT. This model proposes a novel late interaction paradigm for estimating relevance between a query “q” and a document “d”. Under late interaction, wrote the researchers, q & d are separately encoded into two sets of contextual embeddings, and relevance is evaluated using cheap and pruning-friendly computations between both sets.
Neural IR paradigm
- Uses a pre-trained LM(eg: BERT) to encode documents and queries are encoded into numerical representations.
- A scoring function compares representations to determine a ranked list of search results similar to that of a classical paradigm.
- An extraction function then identifies relevant text spans to offer as direct evidence.
- Complex queries are interpreted more deeply than in the classical paradigm.
- Most relevant text from documents can be highlighted and the excerpts can be combined directly from multiple sources.
Researchers are betting on the work pushing the NLP and IR frontiers. For example, Neural IR methods like ColBERT and Baleen can be used to improve search results while preserving trust and reliability. The researchers believe this area has a huge potential to revamp the way we use the internet. That said, they also admit that language models are not, and will never be, all-knowing “infallible oracles”, even if they seem to behave that way when you ask them questions.