The recent legal clash between The New York Times and OpenAI about copyright in AI models has thrust the term “memorisation” or “plagiarisation” into the forefront. One such term more closely related to AI is ‘approximate retrieval’ — and it may be all that OpenAI needs to win the case.
“We can have conversations with our dogs, or even a potato,” said Subbarao Kambhampati, professor at ASU, in a podcast, talking about ChatGPT and how information received from it can not exactly be repeated. “The biggest problem is that it should be factual.” He says it is like an AI toothpaste, which has all the human wisdom and knowledge within it and can be squeezed out in a convenient form, however needed.
LLMs are “not exactly repeating it but they are the kinds of things that you are likely to be talking about”, he added.
At the core of approximate retrieval lies the fact that LLMs don’t fit the mould of traditional databases, where precision and exact matches are paramount. Instead, they operate as n-gram models, injecting an element of unpredictability into the retrieval process.
Rather than functioning as keys to a structured database, prompts serve as cues for the model to generate the next token based on context. Kambhampati explained in a recent LinkedIn post that for the legal discourse surrounding the NYT lawsuit, this distinction becomes crucial.
LLMs don’t promise exact retrieval, blurring the lines between flexibility and unpredictability. They exist in a space neither purely database nor traditional Information Retrieval (IR) engine, prompting a closer examination of their characteristics.
A trade off between better AI models and proper copyright attribution
The lawsuit itself revolves around the delicate issue of memorisation. While LLMs cannot guarantee verbatim reproduction, their extensive context window and robust network capacity open the door to potential memorisation, raising concerns about unintentional plagiarism.
It was visible in the lawsuit where the prompter was able to generate exact sentences if prompted again and again. The attempts made to instil ‘thinking’ abilities into LLMs by fine-tuning them on planning problems, merely transformed tasks into memory-based retrieval, lacking proof of autonomous planning.
This was done alongside increasing the context length of LLMs, making the memorisation problem even worse. Prompting LLMs with hints raised concerns about the reliability of human-in-the-loop methods.
All in all, commercial LLM creators such OpenAI would find themselves strategically navigating both ends of the approximate retrieval spectrum.
In legal discussions, they can emphasise the models’ inability to achieve exact retrieval, framing it as a defence against copyright infringement. Simultaneously, when marketing LLMs for search applications, they highlight the memorisation capabilities as features.
What’s the end goal?
The truth is, there’s no foolproof strategy to control these dual behaviours. Attempts to curb memorisation might compromise the “LLM’s masquerade as a search engine” leaving us in a perplexing conundrum of making the AI models better or caring about copyright.
For example, a user on X points out, “There is a dilemma particularly in news generation: if LLM is too creative, it generates fake news or at least inaccurate news; otherwise, copyright problem comes into play. There is a problem either way.”
Another user points out that the same is the case with AI image generators based on Diffusion models such as Midjourney, Stable Diffusion, and DALL-E, which also do not aim to generate the same images, but end up creating very similar outputs. The better these models are getting, the closer they are to what the user prompts, and not the inherent need to avoid copyright.
The emergence of the Retrieval-Augmented Generation (RAG) trend introduces an external IR component, attempting to blend with LLMs with a more structured approach to information retrieval. It’s a nuanced effort to strike a balance between the spontaneity of LLMs and the orderliness of traditional search methods which were mostly introduced to reduce hallucinations in these models.
But Kambhampati explains that this increases the chance of LLM such as GPT-4 retrieving the exact information from sources such as NYT, which are essentially added as a Vector Database on the models. That was the whole purpose of making RAG, but it is standing against the AI model creators when it comes to copyright.
“Because of the way n-gram models work, there is never any 100% guarantee that some stored record (be it a program or an NYT article) is retrieved unaltered. So why is NYT suing OpenAI,” asked Kambhampati. The case remains if the base training dataset actually included NYT articles, which it obviously did, and if OpenAI’s model created a dent at the publications revenue.
If LLM makers try to reduce “memorisation”, they will certainly see that the ability of LLMs to masquerade as search engines, “which is already quite questionable, degrade even further”, concluded Kambhampati.