How to use LLMs for text summarization and abstraction

In recent years, there has been a tremendous improvement in the field of natural language processing (NLP) with the introduction of large language models (LLMs) like GPT-3, BERT, and T5. These models have revolutionized various NLP tasks, including text summarization and abstraction.

Text summarization is the process of condensing a long document into a concise summary, capturing the essential information. On the other hand, text abstraction aims to generate a summary by incorporating new information and rephrasing the original text.

In this tutorial, we will explore how to use LLMs for text summarization and abstraction using the Hugging Face Transformers library in Python. We will walk through the steps of preprocessing the data, fine-tuning the LLM on a summarization dataset, and generating summaries and abstractions from new text inputs.

Prerequisites

Before we get started, ensure that you have the following prerequisites installed on your system:

Python 3.6 or higher
pip package manager
virtualenv (optional, but recommended)

To install the necessary libraries, run the following commands:

pip install transformers
pip install torch

Once you have the prerequisites installed, we can proceed with the tutorial.

Preprocessing the data

Text summarization and abstraction models often require large amounts of preprocessed data for training. In this tutorial, we will use the CNN/DailyMail dataset, a popular benchmark dataset for text summarization. The dataset consists of news articles paired with bullet point summaries.

To download the dataset, run the following command:

wget https://s3.amazonaws.com/datasets.huggingface.co/summarization/cnn_dm_v2.tgz
tar -xzvf cnn_dm_v2.tgz

This will download and extract the dataset into the current directory.

Next, let’s preprocess the data by converting it into a format suitable for fine-tuning our LLM. We will create a Python script called preprocess.py and add the following code:

import jsonlines

input_file = 'cnn_dm_v2.0/cnn_dm.jsonl'
output_file = 'preprocessed_cnn_dm.txt'

with open(output_file, 'w') as f:
    with jsonlines.open(input_file) as reader:
        for obj in reader:
            text = obj['article'].replace('n', ' ')
            summary = obj['highlights'].replace('n', ' ')
            f.write(f'summary: {summary}  text: {text}n')

Save the file and run it using the following command:

python preprocess.py

This will preprocess the dataset and create a file named preprocessed_cnn_dm.txt. Each line in this file consists of the format: summary: <summary> text: <text>.

Fine-tuning the LLM

Now that we have preprocessed our data, we can fine-tune an LLM on the preprocessed dataset. For this tutorial, we will use the T5 model, which has shown excellent performance for text summarization and abstraction tasks.

First, we need to import the necessary libraries and define some constants:

import torch
import torch.nn.functional as F
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

MODEL_NAME = 't5-base'
MODEL_PATH = 't5_finetuned_summarization_model.pt'
TOKENIZER_PATH = 't5_tokenizer'

Next, let’s load the preprocessed dataset and tokenizer:

with open('preprocessed_cnn_dm.txt', 'r') as f:
    data = f.readlines()

tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)

We will now tokenize the data and prepare it for training:

inputs = tokenizer([f'summarize: {line}' for line in data],
                   max_length=512, truncation=True, padding='longest',
                   return_tensors='pt')

labels = tokenizer([f'{line}' for line in data],
                   max_length=512, truncation=True, padding='longest',
                   return_tensors='pt')

input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']
labels = labels['input_ids']

Note that we prepend the "summarize: " prefix to each training example to let the LLM know that it is a summarization task.

Now, we will define the configuration and model for fine-tuning:

config = T5Config.from_pretrained(MODEL_NAME)
model = T5ForConditionalGeneration(config)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

Next, we will define a training function to fine-tune the model:

def train(model, dataloader, optimizer, device):
    model.train()
    total_loss = 0

    for batch in dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask,
                        labels=labels)
        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        optimizer.step()

    return total_loss / len(dataloader)

Now, let’s create a dataloader and train the model:

BATCH_SIZE = 8
EPOCHS = 3

dataset = torch.utils.data.TensorDataset(input_ids, attention_mask, labels)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)

for epoch in range(EPOCHS):
    train_loss = train(model, dataloader, optimizer, device)
    print(f'Epoch {epoch+1}/{EPOCHS} - Train Loss: {train_loss}')

torch.save(model.state_dict(), MODEL_PATH)
tokenizer.save_pretrained(TOKENIZER_PATH)

After training, the fine-tuned model and tokenizer will be saved as t5_finetuned_summarization_model.pt and t5_tokenizer, respectively.

Generating summaries and abstractions

Now that we have a fine-tuned model, we can generate summaries and abstractions for new text inputs. To do this, we can load the model and tokenizer from the saved files.

First, let’s define a function to generate summaries:

def generate_summary(text, model, tokenizer, device):
    model.eval()

    inputs = tokenizer([f'summarize: {text}'],
                       max_length=512, truncation=True, padding='longest',
                       return_tensors='pt')

    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)

    output = model.generate(input_ids=input_ids,
                            attention_mask=attention_mask,
                            max_length=150,  # change according to desired summary length
                            num_beams=4,
                            early_stopping=True)

    summary = tokenizer.decode(output[0], skip_special_tokens=True)

    return summary

Now, let’s define a function to generate abstractions:

def generate_abstraction(text, model, tokenizer, device):
    model.eval()

    inputs = tokenizer([f'abstract: {text}'],
                       max_length=512, truncation=True, padding='longest',
                       return_tensors='pt')

    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)

    output = model.generate(input_ids=input_ids,
                            attention_mask=attention_mask,
                            max_length=150,  # change according to desired abstraction length
                            num_beams=4,
                            early_stopping=True)

    abstraction = tokenizer.decode(output[0], skip_special_tokens=True)

    return abstraction

With these functions in place, we can now load the fine-tuned model and tokenizer and use them to generate summaries and abstractions:

model = T5ForConditionalGeneration.from_pretrained(MODEL_PATH)
tokenizer = T5Tokenizer.from_pretrained(TOKENIZER_PATH)
model.to(device)

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus ultrices dapibus urna ac commodo."
summary = generate_summary(text, model, tokenizer, device)
abstraction = generate_abstraction(text, model, tokenizer, device)

print('Summary:', summary)
print('Abstraction:', abstraction)

Make sure to replace text with the actual input text you want to summarize or abstract.

And that’s it! You have now learned how to use LLMs for text summarization and abstraction using the Hugging Face Transformers library in Python.

Conclusion

LLMs like T5 have revolutionized various NLP tasks, including text summarization and abstraction. In this tutorial, we explored how to use the Hugging Face Transformers library to fine-tune an LLM on a summarization dataset and generate summaries and abstractions from new text inputs. We also discussed the preprocessing steps and trained the model using the CNN/DailyMail dataset.

By leveraging the power of LLMs, you can now build powerful text summarization and abstraction systems that can condense and rephrase lengthy texts, opening up possibilities for automated content generation, information retrieval, and more.

Prerequisites

Preprocessing the data

Fine-tuning the LLM

Generating summaries and abstractions

Conclusion

Related Post