Recently, a SaaS company hired us to build Generative AI into their offering. After the explosion of interest in ChatGPT, their CEO became interested in using a Large Language Model (LLM) like OpenAI’s GPT-4 to go beyond delivering basic snapshot reports about a website's performance to delivering tailored, actionable recommendations based on the data in those reports.
At first, the CEO tried simply prompting GPT-4 with their existing reports. Unfortunately, this approach only delivered generic, inconsistent recommendations that neither their customers found useful nor their customer service team could back up.
The problem, as many teams quickly discover, is that although GPT-4 has state of the art reasoning capabilities, it doesn't come out of the box with access to your data. Nor does it have your domain expertise, or understand your company's processes. Ultimately, many generative AI projects eventually arrive at the question "How can Generative AI become conversant with our data?"
The solution is retrieval augmented generation, or RAG.
RAG is a technique that supplies generative AI with external data in order to make it more intelligent.
This technique opens up the possibility of question answering, recommendations, and many other use cases. And not only that, RAG dramatically reduces "hallucinations" or wrong answers, which is a huge problem with current models.
I've written this guide for executives or beginner engineers who want to understand how RAG works, but don't want to fuss with the technical details*. If you stick with me, I promise by the end you'll understand RAG from a first principles level, and it will help you navigate everything from product ideation to development strategies. Let's go!
When working with LLM’s such as OpenAI’s GPT-4, the most basic unit of interaction is a prompt, and in that prompt are an input and a response. Here's an example:
Now, in a simple prompt like this, the system knows the answer already because in the large corpus of text it’s been trained on (more or less, the entire internet) this question and answer has appeared enough times that it’s learned what the proper answer is. But now consider a more complex prompt & response:
Unfortunately GPT-4 is already letting us down! This answer won’t do if we want to support a weather retrieval use case. But we can't actually blame GPT-4 since it was only trained on data up to April of 2023. It can’t possibly know what the weather is today. To get the correct answer, we create a hidden input which gives the answer to the LLM behind the scenes. Here’s how that works:
We’ve done two important things here. First, we’ve expanded the input & response model to include hidden inputs which are only visible to the LLM. Second, we've selected a relevant input that “grounds” or "adds context to" the prompt for the AI correctly answer.
Without these hidden grounding inputs, LLM’s are so unpredictable that they may simply make up an answer, or they may say they don't know the answer. That's both a feature and a bug of these systems: their stochasticity is part of what fuels their creativity. But often we don't want creativity, we want reliability. With the relevant weather report in hand, they answer the question much more faithfully, on the order of 95%+ success rate as opposed to 50%.
In a nutshell, this method of adding a hidden prompt to guide the LLM is what retrieval augmented generation is all about. Everything past this point is simply additional techniques that refine this basic paradigm. If you take away one thing about RAG, take away this:
RAG grounds prompts with data so that responses are more accurate.
Okay, hopefully this all makes sense, but how did the RAG system know to incorporate the weather results into the prompt in the first place?
Let’s add some more detail into the basic picture.
Let’s expand our prompt model to include two more pieces: a function list, and an execution call. This may sound a little technical but I promise it’s worth understanding as it forms the basis for how your company’s proprietary data can be used a bit later.
A function list requires a little pre-work on our part. We compile a list of data the AI can access and then we tell the AI “when you encounter a need to get the weather, use this specific function." When the AI calls that function, behind the scenes it gets the weather from, say, the weather.com API and returns it to the AI to ground the prompt.
So the full picture looks like this:
Et voila! The AI has decided on its own to use one of the pre-built functions we've provided, and in this example it made a round trip to the Weather.com API. Now, this is a very simple example, but if you really understand it, you should be able to close your eyes, pause for a minute, and imagine some additional ways this paradigm could be used. I know, cheesy, but give it a try.
What did you come up with? The possibilities are truly endless.
We can pull in all kinds of data which will make interfacing with the AI far more useful than it comes out of the box.
Alright, so that works for a third party API, but what about my data?
Let’s say you have a Notion knowledge base of 200 articles that covers 95% of your customer’s support questions, and you want to create an AI assistant that uses that knowledge to reduce load on your support team. How do you make the AI conversant with the knowledge base?
The first and most obvious solution would be to replace the example above's call to the weather.com API with a call to Notion's search API. That would look like this:
Here, we’re using the same paradigm as above, but when a user asks "How can I reset my password," we’re searching through your knowledge base articles for relevant results instead of going to the weather.com api.
For simple cases, this works reasonably well, but there’s a problem. What if the user asks something like “How do I login if I can’t remember my passcode”? Well, in that case we’d look through the notion database for “remember passcode”, but that wouldn’t return any results because all throughout the knowledge base we refer to resetting a password, not remembering a passcode. Although these two phrases are conceptually related, a traditional keyword search would catastrophically fail for the latter.
The problem compounds when you consider all the variations real-world users could say in an open-ended conversation:
As it turns out, natural language is just really messy, so it’s hard to predict what words people will use to declare their intent. The good news, however, is a major part of the Generative AI revolution is a solution for exactly this problem, and it’s called vector embeddings.
In the olden days of chatbots (two years ago), there was a coffee mug that said:
What do we want!?
Chatbots!
When do we want 'em!?
I'm sorry, I do not understand that query.
These older systems were so brittle that they often struggled to understand what a user really wanted, often producing a frustrating experience.
Vector embeddings change all that. The core innovation is a method to make a concept equal a number. It may sound simple but it was actually incredibly difficult to pull off, and when it finally happened it sparked the entire Generative AI revolution. For the first time, we could transform messy, ambiguous natural language into a mathematical precise set of coordinates. And once you have a set of coordinates for a phrase or a concept, you can look for nearby, related phrases.
In our password reset example, a vector system can recognize that "reset password" and "forgot login code" are semantically similar, despite sharing no matching keywords. Here's the updated picture:
I didn’t have to change much in the example except for two key things. First, we’re calling a different function which does a vector search; and second, we’re sending the entire user query instead of selected keywords. In this case, it found and retrieved the entire knowledge base article related to the concept of "password resets" and placed it into the prompt as an input. The AI then reads that article before it answers the user's question, and, if the AI is any good, gives the correct response from that article.
For this approach to work, you have to pre-populate the vector database with the vector coordinates for each document that could be referenced:
After you have the vector embeddings of the user's input, you simply look for nearest neighbors to that concept in a document. Below I'll have more to say about how this works under the hood, but lets move on to some problems with this approach.
The first problem is that although vector search often returns relevant results, it doesn't always work well. Consider an example. If you’re searching for information on a Genesis GV80, you actually do want to look for documents containing exactly those keywords, not concepts closely related to the Genesis GV80. If you own a GV80 you don't care about a Hyundai Elantra, but conceptually they're similar (Genesis is the luxury spinoff at Hyundai.) So while in many use cases the vector approach alone is effective, it doesn't cover all of them.
The second problem is that vector databases can be expensive to run. As I'll explain below, in order for question answering to work with vectors, you have to get vector coordinates for each and every knowledge base article you have, or perhaps even worse, for every section of every knowledge base article. And you also have to get vector coordinates for every single user query. The coordinates are so large that the required compute can be cost prohibitive.
Both these problems are manageable. When vector search alone isn't enough, there's an approach called a "Hybrid" strategy which combines vector and keyword searches. And careful planning and prompt engineering can often keep costs within an acceptable range. Now, let’s pause and survey where we’re at.
We started by understanding the prompt input & response model, then expanded it into a special case which accommodates pulling third party data, and we’ve just expanded it into an even more special case to include vectorized data. As far as the basic paradigm goes, if you’ve made it this far, congratulations! You truly understand how RAG works. In the remaining section, I will go into more detail on how to optimize vector-based RAG, but it’s really just icing on the cake.
As far as icing goes though… you might think to ask, should I really dump the entire knowledge base document into the prompt and then have the AI answer it from there? Wouldn’t that be pretty expensive since costs are proportional to how many words are sent to the LLM?
Indeed, if vector RAG isn’t optimized it can get very expensive. If you have 30,000 active monthly users who are asking questions 5 times a day, on today’s costs you could easily blow $10k/mo.
What’s more, it’s only a luxury of modern LLM’s that they’re even able to answer questions from a long knowledge base document. In the old days (a few months ago), there were such small “context windows” (the amount of words the LLM could process at a given time) that developers were forced to come up with clever tricks to get only the most important pieces of context into the prompt without making it too long.
One of those tricks is called chunking, and no it does not involve regurgitating ones food.
In our knowledge base example, instead of retrieving the entire knowledge base document and inserting it into the prompt, a more efficient approach would be to extract only the most relevant portions of a document. By working with only portions of a document instead of the entire document itself, we can significantly reduce computational load and costs.
Conceptually, the key pieces here are a knowledge base, which is a collection of documents, which is a collection of chunks.
Chunking is just splitting a text up. Many RAG systems today simply chop a given text up into around 750 words each and call it a day. For example, let’s stipulate the knowledge base article looks like this:
If you simply chop up the document into 750 word chunks, you’d end up with the first thought and about half of the second thought combined in one chunk. Let's call this crude splitting. What’s slightly shocking is even with incomplete information, that approach actually works pretty well in a lot of situations. Why? Because LLM’s are really good at making stuff up and because it was trained on essentially the entire internet, it often fills in the correct details. The problem, of course, is when the AI hallucinates incorrect details.
A more sophisticated approach is to split each chunk out based on semantic fault lines. The best “chunks” are single, independent, and complete thoughts. In our password reset example, the knowledge base article should ideally be divided into each of the bullet points. That could be achieved through a number of strategies, such as splitting chunks at each header in the copy; an algorithm which groups chunks based on keeping track of keywords used in each sentence; or even asking GPT-4 to look at the document and split it up for you.
So what lesson can we draw here?
In the halcyon days of today’s large context windows, there’s a tradeoff between chunk size and accuracy. If you have huge chunks, such as an entire knowledge base document, with advanced LLM's lke GPT-4 you probably will have highly accurate results for question answering because the AI will pick out the right information. On the other hand, if you have smaller chunks you’ll have less accuracy because ultimately there’s less context for the AI, and you may not always provide the right context. But, on the other hand, your costs go down.
If you’ve made it this far, congratulations! You have a first principles understanding of RAG that exceeds that of most developers on the planet, because this is a very new technology. As an executive, you should now be prepared to understand the possibilities and pitfalls RAG can bring your product and your team. The cases we've covered here could easily be modified to create:
In Part 2, I'll be covering some advanced RAG strategies, some of which have emerged in just the last few weeks as of writing. What happens, for example, when you want to pull multiple chunks or documents into a single prompt? How do you order those to get the best results? Sign up to be notified when Part 2 comes out:
Of course, it takes a talented team up to date on the latest developments to build and deploy systems like these systems at scale. If you don’t have that team in house yet, reach out to me at andy@emerge.haus – we are just that team for hire.
* For those already familiar with RAG, you will notice I have not only left out technical details but in some cases have told little white lies (for example, equating tokens to words) that make the text clearer but are technically incorrect. I stand by these as I believe elaborating them would only obfuscate, rather than clarify, first principles.