Stunning image by NASA, courtesy of Unsplash
Prelude
If you’re feeling adventurous and want to dive straight into the code, here’s the link.
Just know — I’ll be slightly disappointed if you skip the read.
Introduction
Traditional Large Language Models (LLMs) generate content unconditionally, based on the tokens observed during training. While this can be very powerful (it is in fact, quite powerful), they often run the risk of hallucination and untraceable knowledge generation. Even the cutting-edge and smartest LLMs risk generating false information, as there’s no conditioning on the response generated other than the previous tokens already generated.
Retrieval Augmented Generation (RAG) systems on the other hand, attempt to retrieve the best documents for the query and generate a response conditioned on those documents. This restricts the hallucination to a minimum and makes retracing the sources of generated knowledge much easier, as the retrieved documents can be displayed as actual citations. This capability of RAGs makes them particularly appealing for building domain-specific systems, as they can be tailored to each domain based on the available documents.
I came across this concept during one of my master’s courses and was intrigued by the existing RAG systems. Most of them typically consist of two main components — a retriever (usually vector-based for fast, semantic search instead of plain keyword matching) to fetch relevant documents, and a generator, a language model that summarizes the retrieved content (often using large models like GPT-4 or similar). This naturally made me curious to build a system that was fully local and vendor-independent — albeit on a smaller scale. Running language models locally is, and likely will remain, computationally expensive. So I chose to work with small-sized LLMs (a relative term, of course).
Apologies for the lengthy introduction. Since this is a deep topic, I’ve split the discussion into a multi-part series — feel free to go through them in order. In this first part, we’ll take a high-level look at how the project is structured, and in the upcoming parts, I’ll dive into each component in more detail.
System Overview
At the end of the day, I had one objective in mind — The RAG system should be completely capable of running locally, with no vendor tie-in required for any step of the process. Therefore, every component is designed as such. The following are the crucial modules that power the full system.
UI + FastAPI Server
The main interface through which the user can submit their query and the number of results they want the response to be constructed from. It is a single page, completely built from scratch with HTML, CSS and JS (with dark mode support!). The only heavy elements used are two — FeatherIcon CSS for displaying certain icons and Google Fonts for well, fonts.
The UI is served via FastAPI, which felt like a natural choice to tie the system together. It not only hosts the web page but also handles user queries — receiving the input, forwarding it to the RAG engine, and returning the generated response back to the frontend.
Retriever
The “R” in RAG — and arguably the more critical half of the system. That’s because the quality of the final response directly hinges on the quality and relevance of the documents retrieved.
While the majority of RAG implementations focus solely on vector-based retrieval, I chose to go with a hybrid approach. Depending on the configuration (which we’ll revisit later), the system first attempts to retrieve documents from a local vector store. If the results are insufficient or if the store is disabled, it gracefully falls back to a conventional API fetch using the same query.
There’s also an option to cache API-fetched documents back into the vector store. This allows the system to “learn” over time — growing smarter and more useful with each search, as its local knowledge base gradually expands.
Generator
The “G” in RAG, this component loads one of the locally configured LLMs (as defined in the config file) and uses a prompt template, also configurable, to generate the cited summary.
I experimented with a few prompt formats, and while prompt engineering remains a work in progress, the current setup works well enough to complete the end-to-end pipeline. That, in itself, felt like a win.
Pit Stop
To keep things short and hopefully not boring, we’ll pause here. In the next part, I’ll dive into the technical design decisions that shape this system. And after that, I’ll walk through how it all comes together — running live, end-to-end, in a cloud environment.