{"id":4178,"date":"2023-11-04T23:14:07","date_gmt":"2023-11-04T23:14:07","guid":{"rendered":"http:\/\/localhost:10003\/how-to-use-llms-for-text-summarization-and-compression\/"},"modified":"2023-11-05T05:47:57","modified_gmt":"2023-11-05T05:47:57","slug":"how-to-use-llms-for-text-summarization-and-compression","status":"publish","type":"post","link":"http:\/\/localhost:10003\/how-to-use-llms-for-text-summarization-and-compression\/","title":{"rendered":"How to use LLMs for text summarization and compression"},"content":{"rendered":"
In recent years, language models have revolutionized various natural language processing tasks, including text summarization and compression. Language Learning Models (LLMs) are a particular type of language model that leverage large amounts of textual data to generate coherent and concise summaries of longer texts. In this tutorial, we will explore how to use LLMs for text summarization and compression.<\/p>\n
To follow along with this tutorial, you should have a basic understanding of natural language processing (NLP) concepts and some experience with Python programming language. Additionally, you will need to install the following Python libraries:<\/p>\n
You can install these libraries using pip, as shown below:<\/p>\n
pip install transformers torch nltk numpy\n<\/code><\/pre>\nDataset<\/h2>\n
To demonstrate the text summarization and compression techniques, we will use a sample dataset consisting of news articles. You can obtain a similar dataset from various sources, including the News Aggregator Dataset<\/a>, which provides news articles from different publishers. For this tutorial, we assume that you have a dataset in a CSV file format, where each row represents a news article. The CSV file should contain two columns: text<\/code> and summary<\/code>, where text<\/code> represents the full article text, and summary<\/code> represents the corresponding human-generated summary.<\/p>\nPreprocessing<\/h2>\n
Before training an LLM for text summarization and compression, it is necessary to preprocess the dataset. The preprocessing steps include tokenization, removing stop words, and converting the text into numerical representations.<\/p>\n
Tokenization<\/h3>\n
Tokenization is the process of splitting the text into individual words or subwords, often referred to as tokens. We can use the nltk<\/code> library in Python to tokenize the text. The following code snippet shows how to tokenize a text:<\/p>\nimport nltk\nnltk.download('punkt')\n\nfrom nltk.tokenize import word_tokenize\n\ntext = \"This is an example sentence.\"\ntokens = word_tokenize(text)\n\nprint(tokens)\n<\/code><\/pre>\nThe output of the above code will be:<\/p>\n
['This', 'is', 'an', 'example', 'sentence', '.']\n<\/code><\/pre>\nRemoving Stop Words<\/h3>\n
Stop words are commonly used words in a language that do not convey significant meaning and can be removed to simplify the text. The nltk<\/code> library provides a list of stop words for different languages. We can remove stop words from our tokens using the following code:<\/p>\nnltk.download('stopwords')\n\nfrom nltk.corpus import stopwords\n\nstop_words = set(stopwords.words('english'))\nfiltered_tokens = [token for token in tokens if not token.lower() in stop_words]\n\nprint(filtered_tokens)\n<\/code><\/pre>\nThe output of the above code will be:<\/p>\n
['example', 'sentence', '.']\n<\/code><\/pre>\nConverting Text to Numerical Representations<\/h3>\n
LLMs require numerical representations of text to process and learn patterns. One common approach to represent text numerically is the Bag-of-Words (BoW) model, where each word is represented by a unique index. The nltk<\/code> library provides a FreqDist<\/code> class that can be used to create a BoW representation. The following code snippet demonstrates how to create a BoW representation of a text:<\/p>\nfrom nltk.probability import FreqDist\n\nfreq_dist = FreqDist(filtered_tokens)\nbow_representation = freq_dist.most_common()\n\nprint(bow_representation)\n<\/code><\/pre>\nThe output of the above code will be:<\/p>\n
[('example', 1), ('sentence', 1), ('.', 1)]\n<\/code><\/pre>\nTraining a Language Learning Model (LLM)<\/h2>\n
Now that we have preprocessed our dataset, we can train an LLM for text summarization and compression. We will use the Transformer library in Python, which provides pre-trained LLM models such as GPT-2 and BERT.<\/p>\n
Fine-tuning a Pre-trained Model<\/h3>\n
Fine-tuning is the process of training a pre-trained model on a specific task or dataset. In our case, we will fine-tune the GPT-2 model pre-trained on a large corpus of text. The following steps outline the fine-tuning process:<\/p>\n
\n- Load the pre-trained model.<\/li>\n<\/ol>\n
from transformers import GPT2Tokenizer, GPT2LMHeadModel\n\ntokenizer = GPT2Tokenizer.from_pretrained('gpt2')\nmodel = GPT2LMHeadModel.from_pretrained('gpt2')\n<\/code><\/pre>\n\n- Tokenize the text and convert it into numerical representations.<\/li>\n<\/ol>\n
inputs = tokenizer.encode(text, return_tensors='pt')\n<\/code><\/pre>\n\n- Generate a summary using the fine-tuned model.<\/li>\n<\/ol>\n
outputs = model.generate(inputs, max_length=50, num_return_sequences=1)\nsummary = tokenizer.decode(outputs[0])\n<\/code><\/pre>\nThe max_length<\/code> argument controls the maximum length of the generated summary, and the num_return_sequences<\/code> argument determines the number of alternative summaries to produce.<\/p>\nCompression Techniques<\/h3>\n
In addition to generating summaries, LLMs can also be used for text compression by reducing the length of the text while preserving its meaning. Two commonly used compression techniques are extractive compression and abstractive compression.<\/p>\n
Extractive Compression<\/h4>\n
In extractive compression, we select and concatenate the most important sentences or subwords from the original text to form a compressed version. We can use LLMs to identify the important sentences or subwords based on their likelihood given the context. The following code snippet demonstrates extractive compression:<\/p>\n
from transformers import pipeline\n\ncompression_pipeline = pipeline(\"text2text-generation\", model=model, tokenizer=tokenizer)\ncompression_pipeline(text, max_length=50, num_return_sequences=1)\n<\/code><\/pre>\nAbstractive Compression<\/h4>\n
In abstractive compression, we generate a summary or a compressed version of the text using LLMs. This approach allows for more flexibility in compressing the text compared to extractive compression. We can use the same fine-tuned model and generate summaries with different lengths to achieve abstractive compression. The following code snippet demonstrates abstractive compression:<\/p>\n
outputs = model.generate(inputs, max_length=50, num_return_sequences=1)\ncompression = tokenizer.decode(outputs[0])\n<\/code><\/pre>\nEvaluation<\/h3>\n
To evaluate the performance of the LLM for text summarization and compression, we can compare the generated summaries or compressed versions with the human-generated summaries or compressed versions. Two popular evaluation metrics for text summarization are ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy). The nltk<\/code> library provides implementations of both metrics. The following code snippet demonstrates the evaluation using ROUGE and BLEU:<\/p>\nfrom nltk.translate.bleu_score import sentence_bleu\nfrom rouge import Rouge\n\nreference = \"This is a reference summary.\"\ngenerated_summary = \"This is a generated summary.\"\n\nrouge = Rouge()\nscores = rouge.get_scores(generated_summary, reference)\n\nbleu = sentence_bleu([reference.split()], generated_summary.split())\n<\/code><\/pre>\nConclusion<\/h2>\n
Text summarization and compression are important tasks in natural language processing, and with the advent of language learning models, these tasks have seen significant improvements. In this tutorial, we explored the process of using LLMs for text summarization and compression. We covered the preprocessing steps, fine-tuning a pre-trained model, compression techniques, and evaluation metrics. By following this tutorial, you should now have the knowledge and tools to use LLMs for text summarization and compression in your own projects.<\/p>\n","protected":false},"excerpt":{"rendered":"
Introduction In recent years, language models have revolutionized various natural language processing tasks, including text summarization and compression. Language Learning Models (LLMs) are a particular type of language model that leverage large amounts of textual data to generate coherent and concise summaries of longer texts. In this tutorial, we will Continue Reading<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_import_markdown_pro_load_document_selector":0,"_import_markdown_pro_submit_text_textarea":"","footnotes":""},"categories":[1],"tags":[1571,451,245,1573,1572,1358],"yoast_head":"\nHow to use LLMs for text summarization and compression - Pantherax Blogs<\/title>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\t\n\t\n