In this Text Summarisation Part 2 article, I wanted to follow up on my recently published Part 1 on automatic text summarization, the field of NLP concerned with using computers to summarize documents automatically without any human involvement.
I undertook this project for my MSc thesis and this first article gave the preliminary background and some of my early thoughts.
I thought I would follow this up with another post containing my main findings and some concluding thoughts.
What did I set out to achieve?
In conjunction with The Data Analysis Bureau, I identified two areas of opportunity within the field of text summarization:
Evaluation: automatically evaluating summaries is not straightforward. The prevailing method is using the ROUGE automatic metrics, which are insensitive and biased, or human evaluation, which is expensive and impractical. My goal was to identify better automatic evaluation metrics.
Long document summarization: Transformer-based sequence-to-sequence models are the best way of summarizing short documents, but the space and time complexity rises quadratically as longer sequences are used. My goal here was to show that approximate versions of the Transformer (with lower space and time complexity) can outperform the leading models on long document summarization tasks.
How did I test these?
The first goal was to find a better set of automatic evaluation metrics. A number of modern “model-based” alternatives using Transformers have recently been proposed but these have not been rigorously tested in a summarization setting.
For the experiments I chose four recent promising metrics: MoverScore, BLEURT, BERTScore and my own metric, BARTScore. Each of these was compared to ROUGE using five rigorous tests spanning three different datasets. For example, one of our tests was designed to assess how well these metrics correlated with the gold standard (human judgement) when evaluating summaries. The hope here was that the model-based metrics would outperform the ROUGE metrics and therefore we could propose that these become the prevailing metrics going forward.
The second goal focused on architectures for long document summarization. The bottleneck with Transformers is their self-attention layer, which scales quadratically with input sequence length so this architecture quickly becomes prohibitively expensive when using longer documents. Several models have recently been proposed tackling this issue and I focus mainly on the Longformer Encoder Decoder (LED). This model is ostensibly BART, a leading summarization model, but replaces the normal “dense” self-attention layer with a “sparse”, lower-parameterised self-attention layer. This reduces the space and time complexity from quadratic to linear, thereby making these models usable for longer documents.
My hypothesis was that the LED would perform similarly to BART for short documents and outperform BART for long documents. This is because the solution for using standard Transformers such as BART for long document summarization is to truncate the document at 1,000 tokens (approximately). This obviously results in large information loss; using the LED will mean less of the document must be truncated and therefore it should perform better.
My first set of experiments therefore compared BART and the LED on the standard summarization task: using articles from the CNN and DailyMail websites, the goal is to reproduce a short summary (written the author of the article) conditioned upon seeing the source article. The second set of experiments tested how well the LED and BART perform on longer documents, and therefore we used the PubMed and arXiv datasets. These are open access repositories containing research papers from Medical and Maths/Science/CompSci respectively. The goal here is to reproduce the paper abstract conditioned on seeing the paper.
What were the major findings?
I can happily report that the results of our experiments were largely what we were expecting! The set of model-based metrics outperformed ROUGE on almost every occasion and showed much higher correlation with human judgement. BARTScore and BERTScore performed best on most tasks so I would therefore recommend using either of these in place of ROUGE to automatically evaluate summaries. Regarding the long document summarization models, our results showed that there was only a very modest (and never statistically significant) drop in performance when comparing the LED and BART for short document summarization. For the long document summarization tasks there were mixed results. On the arXiv dataset, the LED outperformed BART (and, incidentally, PEGASUS, the state-of-the-art approach at the time of writing); this was expected as the LED truncates after the 4,000th word whereas BART truncates after the 1000th word. However, the two models performed similarly on the PubMed dataset. I attribute this to the important information being clustered at the beginning of the document in the PubMed dataset and documents in the arXiv dataset being much longer in general than PubMed documents (over 60% longer).
Thanks for reading and I hope you enjoyed the article. This will likely be the last I post on text summarization, but I will follow this up with any interesting future work I get stuck into! As ever, please get in touch if you have any questions or would like to find out more!