Stunning image by Olena Bohovyk, courtesy of Unsplash

Summarisation Fine-Tuning

Prelude

In case you would like to directly jump into the code, here’s the link to the project.

Introduction

The task of summarisation warrants no explanation. Given a large body of text, we (or in this case the systems we build) try to summarise the important points into a concise paragraph.

Historically, the ways in which summarisation systems are categorised are numerous, but they essentially fall into two broad types:

Extractive summarisation essentially treats it as a classification problem, where each sentence is considered as a binary classification problem, where 0 implies the sentence is not included in the final summary and 1 otherwise. This is one of the simpler ways of achieving a summarisation, as it does not suffer from any hallucination effects but is very limited in terms of capabilities.

On the other hand, Abstractive summaisation tries to generate a completely new paragraph based on the “meaning” derived from the original body of text. While it is far more expressive, it is also far more complicated to implement and suffers from a lot of potential pitfalls. However, language models that have the same Encoder-Decoder architecture are usually good at this task. This is due to the fact that summarisation is naturally a conditional -sequence transduction task as well, where we convert the large sequence of text (input) into a shorter sequence of text (summary) conditioned on the input.

Objective

In this project, I try to fine-tune the small variant of T5 (Text-to-Text Transformer model that is an Encoder Decoder architecture more info here). The small variant has less parameters that can fit into a cheap cloud solution while at the same time, expressive enough to perform our task comfortably well.

Since fine-tuning the entire model would be time-consuming (the full model has 60M parameters after all), I try to fine-tune it using a handy technique I learnt in my course called LoRA. Low-Rank Adaptation (LoRA), essentially introduces two low-rank matrices (hence the name) that are updated during fine-tuning, mainly in the query and value parameters of the attention blocks of the model. Since only the LoRA params are updated, which is a minute fraction compared to the full params, they are much more efficient, as opposed to full-fledged fine-tuning. This is crucial in low-resource environments and is often more flexible and powerful than simple unfreezing of specific layers of the model. In the notebook main.ipynb we can see that we only train 0.48% of the parameters!

Side Note: When to use full fine-tuning vs LoRA?

Full fine-tuning updates all the parameters of a pretrained model, making it memory and computation intensive, which does not scale well with model size. LoRA, on the other hand, freezes the base model and only introduces small trainable matrices into specific layers, typically in the attention blocks. This allows us to train effectively while only updating a minute fraction of the parameters.

It’s ideal for cases where compute or memory is limited, or when we need to fine-tune multiple models efficiently. However, because LoRA reduces parameter updates to a low-rank subspace, it may underperform in tasks that require high expressivity or when the rank is too low. In such cases, full fine-tuning might still be necessary. LoRA seems to work especially well in transformer-based architectures because attention layers are naturally suited for low-rank updates without heavily affecting performance.

Implementation

Everything end-to-end is done using Hugging face from fetching the pretrained model, dataset (CNN/ Daily Mail news summary), fine-tuning using LoRA to even the training and checkpointing loops. This allows for a fast prototyping. While this black-boxes the actual work underneath, most of real world tasks can be accomplished with changing the models, type of inputs and outputs on a similar pipeline.

Project

Through the use of hugging face libraries, the project footprint small, with main.ipynb running the required processed and the utils.py providing the extra methods that I moved to tidy up the notebook.

Results:

While the commonly used to measure summarisation performance is Recall-Oriented Understudy for Gisting Evaluation (ROUGE), it does not capture the entire picture, particularly in abstractive summarisation tasks. This is because ROUGE directly compares the n-gram overlap and therefore does not consider any synonyms that the model could have used for generation.

Therefore I opted to use BERTScore F1, which essentially generates the BERT Contextual embeddings for both the candidate summary (generated by our system) and reference summary (gold standard label) and tries to compute the maximum similarity possible between these two embeddings. The same metric for the same test set was then measured between the vanilla model (what hugging face returns by default) and our fine-tuned version (LoRA).

Tag BERTScore F1
Vanilla Model 0.8594
LoRA Fine-Tuned 0.8665

As you can see, fine-tuning with a bit more parameters introduced by LoRA and only for one epoch already shows noticeable improvement in the F1-metric. This can be further improved by training for more epochs or by using a bigger model and a longer context.

Concerns

While the project covers most aspects of a typical fine-tuning workflow, it is not ideal for many reasons and I will explain them to the best of my abilities below.

  1. Fine-tuned on CNN/Daily Mail: The base t5-small model is already pre-trained on a massive multitask corpus (including summarisation data), so fine-tuning it again on CNN/DailyMail doesn’t test domain transfer. However, this project is intended to showcase the process of adapter-based fine-tuning — the same code can be reused with domain-specific or custom summarisation datasets.

  2. Fine-Tuned for only one epoch: This was a deliberate trade-off to keep compute costs low. The model was trained on an A5000 GPU and the entire fine-tuning process cost ~$1 in cloud credits. Even within these constraints, it showed a +0.71% improvement in BERTScore F1. Further improvement is likely with 2–3 epochs or a switch to t5-base.

  3. Not optimised for inference speed or deployment: The primary focus here was model fine-tuning, not deployment. However, Hugging Face makes it straightforward to export the model or LoRA weights and run inference via transformers, ONNX, or Accelerate.

  4. Restricted Input to 512 tokens and output to 128 tokens: The input was truncated to 512 tokens and the output to 128 tokens to ensure memory efficiency and prevent out-of-memory errors during training on the cloud. These limits are primarily a training-time constraint. At inference time, longer inputs can be processed by chunking the text into 512-token segments and summarizing them sequentially. The resulting summaries can then be stitched together post-processing. Alternatively, longer-context models or sliding window techniques could be explored for better coherence across chunks.