Stunning image by Fabrizio Magoni, courtesy of Unsplash
Serving Models using Lightning Serve
Prelude
In case you would like to directly jump into the code, here’s the link to the project.
Introduction
In the previous blog post, I explained how I fine-tuned a t5-small model for summarisation using LoRA (Low-Rank Adaptation). This post focuses on how I served that trained model efficiently using LitServe — a lightweight, production-ready serving framework built on top of FastAPI.
I came across the LitServe API and was pleasantly curious to try it out for this particular project. While it can be easily achieved manually using FastAPI and Uvicorn, LitServe API (also uses the same under the hood!) has a lot of “batteries-included” approach that seems to reduce a lot of the boilerplate code.
Benefits of LitServe
The following are my observed benefits of using LitServe API.
- Rapid prototyping: Not that it is a good practice but, LitServe makes it so easy that, you can complete the entire serving application in a single class.
- Batching by default: Using startup arguments, you can automatically batch requests together for better throughput of your models, instead of serving them sequentially.
- Clean hooks for setup and request: LitServe API provides
setup()
andpredict()
methods to override and write our custom logic to load the model on startup and serve receive requests respectively.
This would help in preventing the custom batching logic that you might have to manually write, which could be error-prone.
Project
The actual project is organised as follows:
app/
contains the main logic for serving and loading our fine-tuned model.model/
contains the weight of the fine tuned model. While actual practice involves saving them in a remote bucket and fetching during build or startup, the choice to commit directly into the repository is to simplify and reduce cost for me as a hobbyist.- A
Dockerfile
is also included to containarise the application, to allow for deployment in any compatible environment to serve requests.
Example Request
Once the server is running:
curl -X POST http://localhost:8080/predict \
-H "Content-Type: application/json" \
-d '{"input": "The CNN/Daily Mail dataset is a collection of news articles used for summarisation tasks..."}'
Returns
{
"output": "A dataset of news articles used for summarisation."
}
Conclusion
LitServe turned out to be a convenient tool for model serving — allowing me to focus on loading logic and inference while handling the REST interface and batching for me. While this setup is currently local, it’s already containerized and ready to deploy on any compatible cloud platform.