Stunning image by Andras Vas, courtesy of Unsplash

Live Demonstration

This demonstration was run on a cloud instance with an RTX A5000 GPU using Microsoft’s Phi model as the generator. If you plan to use Mistral as the language model, note that it requires a Hugging Face API key since it is not publicly accessible. Phi and GPT models can be used without a key by configuring them in config.yml.

The first run will take longer, as the model weights and the embedding function for ChromaDB are downloaded initially. Below is a video of the system in action. (The response time is noticeably slow due to GPU limitations.)

Video

IMAGE ALT TEXT HERE

What Went Well..

  • Surprisingly good output from Phi: The Phi model generated coherent summaries with proper inline citations for the given query, following multiple rounds of prompt tuning. There was no fine-tuning done.
  • Vector store growth: Documents retrieved from the API were successfully vectorised and stored in ChromaDB for future use.
  • Responsive UI: The interface remained snappy and interactive throughout the demonstration.

…And What Didn’t

  • Slow inference: The time to generate a summary was painfully long. This could be mitigated with more powerful or distributed hardware, but that’s beyond the scope of this portfolio project.
  • Inconsistent summarisation: Citation formatting varied. While some responses were crisp and relevant, others veered off and generated excess tokens — a clear sign that better results would require fine-tuning on a domain-specific dataset.
  • Naive document ranking: A simple distance threshold was used to filter local results before falling back to the ArXiv API. While functional, more advanced re-ranking techniques could improve precision.

Conclusion

Despite its limitations, the system delivered a complete, working end-to-end RAG pipeline running in the cloud. Seeing it in action underscored both the promise and the computational demands of even small-scale LLMs. Document retrieval typically took ~11 seconds, while summarisation took ~60 seconds — both of which could be improved with stronger infrastructure. Still, as a self-contained, vendor-free solution built from the ground up, this was a very satisfying personal milestone.