By now pretty much everyone has heard of Large Language Models (LLMs) be it OpenAI’s GPT, Gemini (formally Bard) by Google, Anthropic’s Claude, or several other multi-million dollar multi-month trained models. They’re great for helping you draft an email or answer basic questions. They’re even pretty solid at providing boilerplate code for your next software project. However, these corporation funded projects are only as good as the information they’ve been trained on. I don’t expect ChatGPT to give me accurate information on current events, the stock of new t-shirts at my favorite clothing store, or what the current standings are for the champions league. At least without the help of some plugins that the models might let you use.
But how do these plugins help? How do they provide new information to the model without having to retrain the massive network of weights and biases that is the LLM? Queue RAGs, or Retrieval Augmented Generation. RAGs allow users to take some text data, transform it into machine readable formats, and supplement the LLM’s “knowledge” without needing to rebuild the whole thing.
TLDR; RAG is highly customizable automated prompt engineering.
Basic Architecture of a RAG
Fundamentally a RAG consists of the LLM, a corpus (text documents etc), a vector database, some code around indexing and retrieval across said database, and the prompt that the user kicks the whole process off with. How one chooses to implement these components is both art and science as there is no agreed upon standard. If you want to create your own, you’ll need to collect some text data, be it from PDFs, a collection of JSONs, or even SQL tables and transform them into a consistent machine readable format. This text is then broken into segments, then down to tokens, and finally embedded into some vector of numbers I defy any human to understand, but your LLM can. That vector is then put into a specialized database for retrieval later. Keep this embedding process in your back pocket because we’re going to use it for your queries as well. Process your initial query the same way you did these supplemental documents. Now the query can be quickly compared to the other vectors by some distance function. The most relevant results are then retrieved to add more context to the prompt. Allow the LLM to work its magic on all that and your results are now current, better informed, and more accurate than the vanilla LLM.
Why not fine tuning?
You’ve been in this space for a while. You’re used to taking a pre-existing model and fine tuning it. It only takes a few more epochs, it’s cheaper than retraining the whole model yourself, and surely it leads to better outcomes right? Yeah, not so much. Remember these LLMs are millions or even trillions of parameters! Doing it yourself is a pain. Models like GPT that let you fine tune through their platform have other downsides as well. A fine tuned model is much more likely to hallucinate an incorrect answer than if it’s leveraging additional context from a vector database. LLMs are incredibly sensitive to the prompt itself which is why RAG is more akin to automated prompt engineering. So the extra cost with RAG is just the added input tokens and your vector database.
Additionally, what if the extra information you’re leveraging gets updated? One does not simply retrain an LLM. A RAG is far more suited for live and fast changing information as the embedding and indexing cycle is fairly cheap to update once it’s established. This is especially visible in projects that depend on a live database such as the number of objects in stock, or queries that wish to synthesize answers around current events.
The best metaphor I’ve heard for this comparison is: “Fine-tuning is like memorizing the book before the exam, and using RAG is bringing the book with you”. The LLM has the ability to creatively put words together, effectively bringing the “knowledge”, but having the literal reference material at your fingertips is always going to lead to a better answer (if you have good indexes and similarity measures).
It’s a fast world in A.I., I thought RAG was dead
It’s true that models like Gemini and Claude can entertain context input of over 1 million tokens. However, these models charge by the token, and you don’t want to shove a tome into each asinine query that your customers dream up for the poor ecommerce chatbot you hacked together. Large input also leads to slower response times. You’ll need to play with the amount of extra context returned from the database and the response time you’re expecting. In practice, this isn’t a huge issue, but it’s worth mentioning.
Caveats of RAGs and the Future
Hold up now, it’s not all sunshine and rainbows just yet. Building your own RAG can have some unintended costs. With all this extra data added to your prompt, the length of tokens processed by the LLM is significantly larger, as I mentioned before. There are several scholarly articles solely focused on content summarization in efforts to reduce some of the fluff across both your extra content, and queries alike but it’s not so straightforward (see this article from Salesforce).
Most of the time, you’ll be using some third party to store these extra context vectors. For you or your developers to understand the vectors being used, you’ll probably want to attach at least some metadata. However, this can lead to security risks. Additionally, these hosted providers can range from a cup of coffee per month to the rent for a pretty nice apartment. So it’s up to the developer to mitigate that; do your research.
With such a new technology, there’s not really a settled standard on how to build RAGs. Surely in the coming months cloud platforms and smaller competitors will release low to no-code solutions to build your own RAGs. For now, there are some frameworks that can help you out like LangChain and LlamaIndex. They’ll both require a deeper understanding than I’m willing to go into for this article.
Learn More!
If you’ve found this topic interesting, please join us at J: On the Beach this year in sunny Malaga May 8th-10. Moustafa Eshra and David Leconte of Datastax will be giving a workshop on Wednesday the 8th and a presentation on Friday the 10th. Also on the 10th Stéphanie Marchesseau from Medida will be presenting on the safety of LLMs.