Last updated November 2, 2023
In AI Origins & Evolution

RAG is Just Fancier Prompt Engineering

It is time to build a base set of prompts for RAG systems.

Share

Published on November 2, 2023

by Mohit Pandey

Listen to this story

Everyday there is a new acronym popping up in the generative AI world. One of the latest buzzwords is RAG, which stands for retrieval augmented generation. And it is not just another acronym; it represents a significant leap in the field of LLMs. But what exactly is it?

RAG has gained popularity because it combines the strengths of both retrieval-based and generative-based models. Basically, RAG is about attaching a new database onto a base model, and letting the model retrieve new information from it, and then generate information from it.

This helps in reducing the hallucination in the model. Mostly, it is a vector database, or in certain cases such as GPT-4, it is the internet.

At the end of the day, RAG is just fancier prompt engineering 📝. The whole point of RAG is keeping the model fixed and figuring out how to “stuff” the context window the right way with text to answer a given question.

If you’re a prompt engineer, you need to learn AI… pic.twitter.com/5khJvw4JEL
— Jerry Liu (@jerryjliu0) September 21, 2023

At Cypher 2023, Dhruv Motwani, founder & CEO at SpringtownAI, also discussed and demonstrated the use of RAG and explored its architecture and functionality. The participants were able to deploy applications on their AWS accounts and test the hallucinations of the models, which were lesser when using RAG.

Is it really any good?

“The RAG that you see today is really glorified prompt engineering. The most common standard RAG flow we currently have is completely unaware of the context of your data. Without looking up in your data, it sends the lookup plus the original query to GPT,” Mark McQuade, co-founder of Arcee.ai told AIM.

McQuade and his team have built an End2End RAG system called DALM, which sits on top of the main LLM. He said that the best way to use it is to pair it with an in-domain specialised model with the system, instead of a larger model with tons of unnecessary data.

Before delving into RAG, it’s essential to understand the choices that AI developers have when working with AI models. They can either build a model from scratch, fine-tune an existing model, or employ retrieval augmented generation. Each approach has its pros and cons, and the bigger the model, the bigger are the chances of hallucinations.

Building from the ground up can be a costly and time-consuming endeavour. For instance, OpenAI invested over $100 million to train its GPT-4 model. On the other hand, fine-tuning existing models with additional data is a viable option, but it carries the risk of the model “forgetting” some of its original training data.

RAG combines retrieval-based and generative-based AI models to deliver context-aware responses. A retrieval model is used to access information from existing knowledge sources, such as databases or online articles. The generative model then takes this retrieved information and synthesises it into coherent, contextually appropriate responses.

The key advantage of RAG is its ability to provide responses that are not only accurate but also unique, akin to human language, rather than simply summarising retrieved data.

At its core, RAG is essentially an advanced form of prompt engineering. It focuses on keeping the model fixed and optimising how it “stuffs” the context window with text to answer specific questions. This approach is particularly beneficial for prompt engineers who need to learn AI engineering skills and master a base set of prompts to build RAG/agent systems.

In more complex RAG/agent systems, it’s not just about a single prompt; it involves a collection of prompts that work in harmony to provide accurate and context-aware responses.

RAG extended

Some researchers argue that RAG might not be as beneficial as compared to a longer context window, as both of these offer the same results. A recent study titled ‘Retrieval meets Long Context Large Language Models‘ compared RAG with longer context window LLMs.

https://twitter.com/rohanpaul_ai/status/1710641374385594482

The paper found out that open source Embedding Models/Retrievers outperformed OpenAI models. Combining a simple RAG with a 4k LLM could match the performance of long context LLM. Moreover, RAG paired with a 32k LLM outperformed providing the full context.

While RAG offers significant benefits, it’s important to consider the potential for bad responses when retrieving information. Philipp Schmid, tech lead at Hugging Face questioned if we teach LLMs to be more factually correct and self-reliable and introduced Self-RAG. It is a novel approach of teaching models when to retrieve information and how to use it effectively.

How can we teach LLMs to be factual, correct, and more reliable? 🤔
RAG is one approach to adding information to the prompt. But, always retrieving can lead to bad responses😔

Self-RAG proposes a new method to teach LLMs when to retrieve information and how to use it.🤯

🧶 pic.twitter.com/LDpj6Aqz5F
— Philipp Schmid (@_philschmid) November 1, 2023

Self-RAG involves creating a “critique” dataset to determine when retrieval is appropriate and what information is relevant. By creating a “critique” dataset with retrieval guidelines, developers can then train the critique model on the synthetic dataset. Using prompts, the critique model, and a retriever, the developer can generate a RAG dataset.

After training the LLM on the RAG dataset, including special tokens to instruct the model on when to retrieve or generate responses, the model during inference adaptively generates special tokens based on the query to determine whether retrieval is necessary.

It seems like the hallucinations will continue for a while. But just like prompt engineering, it is time to learn the next AI engineering skill, and catch up with the buzzword, and this time it is to build a base set of prompts for RAG systems.

Access all our open Survey & Awards Nomination forms in one place