{"id":4245,"date":"2023-11-04T23:14:10","date_gmt":"2023-11-04T23:14:10","guid":{"rendered":"http:\/\/localhost:10003\/how-to-use-llms-for-text-summarization-and-abstraction\/"},"modified":"2023-11-05T05:47:55","modified_gmt":"2023-11-05T05:47:55","slug":"how-to-use-llms-for-text-summarization-and-abstraction","status":"publish","type":"post","link":"http:\/\/localhost:10003\/how-to-use-llms-for-text-summarization-and-abstraction\/","title":{"rendered":"How to use LLMs for text summarization and abstraction"},"content":{"rendered":"
In recent years, there has been a tremendous improvement in the field of natural language processing (NLP) with the introduction of large language models (LLMs) like GPT-3, BERT, and T5. These models have revolutionized various NLP tasks, including text summarization and abstraction.<\/p>\n
Text summarization is the process of condensing a long document into a concise summary, capturing the essential information. On the other hand, text abstraction aims to generate a summary by incorporating new information and rephrasing the original text.<\/p>\n
In this tutorial, we will explore how to use LLMs for text summarization and abstraction using the Hugging Face Transformers library in Python. We will walk through the steps of preprocessing the data, fine-tuning the LLM on a summarization dataset, and generating summaries and abstractions from new text inputs.<\/p>\n
Before we get started, ensure that you have the following prerequisites installed on your system:<\/p>\n
pip<\/code> package manager<\/li>\nvirtualenv<\/code> (optional, but recommended)<\/li>\n<\/ul>\nTo install the necessary libraries, run the following commands:<\/p>\n
pip install transformers\npip install torch\n<\/code><\/pre>\nOnce you have the prerequisites installed, we can proceed with the tutorial.<\/p>\n
Preprocessing the data<\/h2>\n
Text summarization and abstraction models often require large amounts of preprocessed data for training. In this tutorial, we will use the CNN\/DailyMail dataset, a popular benchmark dataset for text summarization. The dataset consists of news articles paired with bullet point summaries.<\/p>\n
To download the dataset, run the following command:<\/p>\n
wget https:\/\/s3.amazonaws.com\/datasets.huggingface.co\/summarization\/cnn_dm_v2.tgz\ntar -xzvf cnn_dm_v2.tgz\n<\/code><\/pre>\nThis will download and extract the dataset into the current directory.<\/p>\n
Next, let’s preprocess the data by converting it into a format suitable for fine-tuning our LLM. We will create a Python script called preprocess.py<\/code> and add the following code:<\/p>\nimport jsonlines\n\ninput_file = 'cnn_dm_v2.0\/cnn_dm.jsonl'\noutput_file = 'preprocessed_cnn_dm.txt'\n\nwith open(output_file, 'w') as f:\n with jsonlines.open(input_file) as reader:\n for obj in reader:\n text = obj['article'].replace('n', ' ')\n summary = obj['highlights'].replace('n', ' ')\n f.write(f'summary: {summary} text: {text}n')\n<\/code><\/pre>\nSave the file and run it using the following command:<\/p>\n
python preprocess.py\n<\/code><\/pre>\nThis will preprocess the dataset and create a file named preprocessed_cnn_dm.txt<\/code>. Each line in this file consists of the format: summary: <summary> text: <text><\/code>.<\/p>\nFine-tuning the LLM<\/h2>\n
Now that we have preprocessed our data, we can fine-tune an LLM on the preprocessed dataset. For this tutorial, we will use the T5 model, which has shown excellent performance for text summarization and abstraction tasks.<\/p>\n
First, we need to import the necessary libraries and define some constants:<\/p>\n
import torch\nimport torch.nn.functional as F\nfrom transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config\n\nMODEL_NAME = 't5-base'\nMODEL_PATH = 't5_finetuned_summarization_model.pt'\nTOKENIZER_PATH = 't5_tokenizer'\n<\/code><\/pre>\nNext, let’s load the preprocessed dataset and tokenizer:<\/p>\n
with open('preprocessed_cnn_dm.txt', 'r') as f:\n data = f.readlines()\n\ntokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)\n<\/code><\/pre>\nWe will now tokenize the data and prepare it for training:<\/p>\n
inputs = tokenizer([f'summarize: {line}' for line in data],\n max_length=512, truncation=True, padding='longest',\n return_tensors='pt')\n\nlabels = tokenizer([f'{line}' for line in data],\n max_length=512, truncation=True, padding='longest',\n return_tensors='pt')\n\ninput_ids = inputs['input_ids']\nattention_mask = inputs['attention_mask']\nlabels = labels['input_ids']\n<\/code><\/pre>\nNote that we prepend the \"summarize: \"<\/code> prefix to each training example to let the LLM know that it is a summarization task.<\/p>\nNow, we will define the configuration and model for fine-tuning:<\/p>\n
config = T5Config.from_pretrained(MODEL_NAME)\nmodel = T5ForConditionalGeneration(config)\n\ndevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\nmodel.to(device)\n\noptimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)\n<\/code><\/pre>\nNext, we will define a training function to fine-tune the model:<\/p>\n
def train(model, dataloader, optimizer, device):\n model.train()\n total_loss = 0\n\n for batch in dataloader:\n input_ids = batch['input_ids'].to(device)\n attention_mask = batch['attention_mask'].to(device)\n labels = batch['labels'].to(device)\n\n optimizer.zero_grad()\n outputs = model(input_ids, attention_mask=attention_mask,\n labels=labels)\n loss = outputs.loss\n total_loss += loss.item()\n loss.backward()\n optimizer.step()\n\n return total_loss \/ len(dataloader)\n<\/code><\/pre>\nNow, let’s create a dataloader and train the model:<\/p>\n
BATCH_SIZE = 8\nEPOCHS = 3\n\ndataset = torch.utils.data.TensorDataset(input_ids, attention_mask, labels)\ndataloader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)\n\nfor epoch in range(EPOCHS):\n train_loss = train(model, dataloader, optimizer, device)\n print(f'Epoch {epoch+1}\/{EPOCHS} - Train Loss: {train_loss}')\n\ntorch.save(model.state_dict(), MODEL_PATH)\ntokenizer.save_pretrained(TOKENIZER_PATH)\n<\/code><\/pre>\nAfter training, the fine-tuned model and tokenizer will be saved as t5_finetuned_summarization_model.pt<\/code> and t5_tokenizer<\/code>, respectively.<\/p>\nGenerating summaries and abstractions<\/h2>\n
Now that we have a fine-tuned model, we can generate summaries and abstractions for new text inputs. To do this, we can load the model and tokenizer from the saved files.<\/p>\n
First, let’s define a function to generate summaries:<\/p>\n
def generate_summary(text, model, tokenizer, device):\n model.eval()\n\n inputs = tokenizer([f'summarize: {text}'],\n max_length=512, truncation=True, padding='longest',\n return_tensors='pt')\n\n input_ids = inputs['input_ids'].to(device)\n attention_mask = inputs['attention_mask'].to(device)\n\n output = model.generate(input_ids=input_ids,\n attention_mask=attention_mask,\n max_length=150, # change according to desired summary length\n num_beams=4,\n early_stopping=True)\n\n summary = tokenizer.decode(output[0], skip_special_tokens=True)\n\n return summary\n<\/code><\/pre>\nNow, let’s define a function to generate abstractions:<\/p>\n
def generate_abstraction(text, model, tokenizer, device):\n model.eval()\n\n inputs = tokenizer([f'abstract: {text}'],\n max_length=512, truncation=True, padding='longest',\n return_tensors='pt')\n\n input_ids = inputs['input_ids'].to(device)\n attention_mask = inputs['attention_mask'].to(device)\n\n output = model.generate(input_ids=input_ids,\n attention_mask=attention_mask,\n max_length=150, # change according to desired abstraction length\n num_beams=4,\n early_stopping=True)\n\n abstraction = tokenizer.decode(output[0], skip_special_tokens=True)\n\n return abstraction\n<\/code><\/pre>\nWith these functions in place, we can now load the fine-tuned model and tokenizer and use them to generate summaries and abstractions:<\/p>\n
model = T5ForConditionalGeneration.from_pretrained(MODEL_PATH)\ntokenizer = T5Tokenizer.from_pretrained(TOKENIZER_PATH)\nmodel.to(device)\n\ntext = \"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus ultrices dapibus urna ac commodo.\"\nsummary = generate_summary(text, model, tokenizer, device)\nabstraction = generate_abstraction(text, model, tokenizer, device)\n\nprint('Summary:', summary)\nprint('Abstraction:', abstraction)\n<\/code><\/pre>\nMake sure to replace text<\/code> with the actual input text you want to summarize or abstract.<\/p>\nAnd that’s it! You have now learned how to use LLMs for text summarization and abstraction using the Hugging Face Transformers library in Python.<\/p>\n
Conclusion<\/h2>\n
LLMs like T5 have revolutionized various NLP tasks, including text summarization and abstraction. In this tutorial, we explored how to use the Hugging Face Transformers library to fine-tune an LLM on a summarization dataset and generate summaries and abstractions from new text inputs. We also discussed the preprocessing steps and trained the model using the CNN\/DailyMail dataset.<\/p>\n
By leveraging the power of LLMs, you can now build powerful text summarization and abstraction systems that can condense and rephrase lengthy texts, opening up possibilities for automated content generation, information retrieval, and more.<\/p>\n","protected":false},"excerpt":{"rendered":"
In recent years, there has been a tremendous improvement in the field of natural language processing (NLP) with the introduction of large language models (LLMs) like GPT-3, BERT, and T5. These models have revolutionized various NLP tasks, including text summarization and abstraction. Text summarization is the process of condensing a Continue Reading<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_import_markdown_pro_load_document_selector":0,"_import_markdown_pro_submit_text_textarea":"","footnotes":""},"categories":[1],"tags":[1848,39,451,245,41,40,1573,1847,1358],"yoast_head":"\nHow to use LLMs for text summarization and abstraction - Pantherax Blogs<\/title>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\t\n\t\n