{"id":4203,"date":"2023-11-04T23:14:08","date_gmt":"2023-11-04T23:14:08","guid":{"rendered":"http:\/\/localhost:10003\/how-to-use-nlp-for-text-summarization-in-python\/"},"modified":"2023-11-05T05:47:56","modified_gmt":"2023-11-05T05:47:56","slug":"how-to-use-nlp-for-text-summarization-in-python","status":"publish","type":"post","link":"http:\/\/localhost:10003\/how-to-use-nlp-for-text-summarization-in-python\/","title":{"rendered":"How to Use NLP for Text Summarization in Python"},"content":{"rendered":"
In today’s information age, we are often overwhelmed with a vast amount of text data. Extracting the most important information from this data can be a time-consuming and challenging task. This is where Natural Language Processing (NLP) comes into play. NLP allows us to process and understand human language, enabling us to automate tasks such as text summarization.<\/p>\n
Text summarization is the process of creating a concise and coherent summary of a given text while preserving its essential meaning. In this tutorial, we will explore different techniques of text summarization using NLP in Python.<\/p>\n
To follow along with this tutorial, you will need the following:<\/p>\n
Before we can begin text summarization, we need to install the necessary libraries. Some common libraries used in NLP are Text preprocessing is a crucial step in any NLP task. It involves cleaning and transforming raw text data into a format suitable for further analysis. In this section, we will perform various preprocessing steps on a sample text.<\/p>\n Tokenization is the process of breaking a text into individual words or sentences, also known as tokens. It is the first step in many NLP tasks. We will use the The When you run this code, you should see the following output:<\/p>\n Stop words are common words that have little or no significance in determining the meaning of a text. Examples of stop words include “a”, “an”, “the”, “in”, “is”, etc. Removing these words can help reduce noise and improve the performance of our text summarization model.<\/p>\n The When you run this code, you should see the following output:<\/p>\n Stemming and lemmatization are techniques used to reduce words to their base or root form. This can help improve the accuracy of text summarization by reducing duplicate or similar words.<\/p>\n Stemming reduces words to their stem or root by removing common suffixes. The When you run this code, you should see the following output:<\/p>\n Lemmatization, on the other hand, transforms words to their base form using vocabulary and morphological analysis. It considers the context and meaning of a word before performing the transformation. The When you run this code, you should see the following output:<\/p>\n Part-of-speech (POS) tagging is the process of assigning grammatical tags to words based on their role in a sentence. Common tags include noun (NN), verb (VB), adjective (JJ), etc. Knowing the POS of words can help in identifying their semantic meaning and relationships.<\/p>\n The When you run this code, you should see the following output:<\/p>\n Named Entity Recognition (NER) is the process of identifying and classifying named entities in a text into predefined categories such as person names, organizations, locations, etc. This can help in summarizing information about specific entities.<\/p>\n The When you run this code, you should see the following output:<\/p>\n Now that we have preprocessed our text, we can move on to text summarization. There are several techniques we can use to create a summary, including:<\/p>\n In this tutorial, we will focus on extractive summarization as it is relatively easier to implement and often produces more coherent summaries.<\/p>\n TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that represents the importance of a word in a collection of documents. It is often used to rank the significance of words in a text and can be leveraged for extractive summarization.<\/p>\n The When you run this code, you should see the following output:<\/p>\n We have created a TF-IDF matrix that represents the importance of each word in the sentences. The matrix has dimensions (2, 22), meaning there are 2 sentences and 22 unique words across these sentences.<\/p>\n To calculate the importance of sentences, we can sum the TF-IDF scores for all the words in each sentence. Here’s how we can do it:<\/p>\n When you run this code, you should see the following output:<\/p>\n The When you run this code, you should see the following output:<\/p>\n The When you run this code, you should see the following output:<\/p>\n Congratulations! You have successfully created a summary of the text using TF-IDF.<\/p>\n TextRank is an algorithm based on graph theory that can be used for extractive summarization. It treats the sentences as nodes in a graph and computes the importance score of each sentence based on the connections between them. The Update your Python script as follows to use TextRank for extractive summarization:<\/p>\n When you run this code, you should see the following output:<\/p>\n The Evaluating the performance of a text summarization model can be challenging. Since there can be multiple valid summaries for a given text, it is difficult to determine an exact measure of quality. However, there are some commonly used evaluation metrics:<\/p>\n The Then, update your Python script as follows:<\/p>\n When you run this code, you should see the following output:<\/p>\n The In this tutorial, you learned how to use NLP techniques for text summarization in Python. We covered various steps of text preprocessing, including tokenization, removing stop words, stemming, lemmatization, part-of-speech tagging, and named entity recognition. We also explored two techniques for extractive summarization: TF-IDF and TextRank. Lastly, we discussed the evaluation of text summarization using the ROUGE metric.<\/p>\n Text summarization is a vast field with many advanced techniques, such as abstractive summarization, deep learning-based approaches, and multi-document summarization. Further exploring these techniques can help you build more powerful and accurate text summarization models.<\/p>\n","protected":false},"excerpt":{"rendered":" In today’s information age, we are often overwhelmed with a vast amount of text data. Extracting the most important information from this data can be a time-consuming and challenging task. This is where Natural Language Processing (NLP) comes into play. NLP allows us to process and understand human language, enabling Continue Reading<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_import_markdown_pro_load_document_selector":0,"_import_markdown_pro_submit_text_textarea":"","footnotes":""},"categories":[1],"tags":[338,1674,41,40,206,632,75,975,1673,1358],"yoast_head":"\nnltk<\/code>,
spaCy<\/code>, and
gensim<\/code>. Open your command prompt or terminal and run the following commands to install these libraries:<\/p>\n
pip install nltk\npip install spacy\npip install gensim\n<\/code><\/pre>\n
Text Preprocessing<\/h2>\n
Tokenization<\/h3>\n
nltk<\/code> library for tokenization. Place the following code at the top of your Python script:<\/p>\n
import nltk\nfrom nltk.tokenize import word_tokenize, sent_tokenize\n\nnltk.download('punkt')\n<\/code><\/pre>\n
word_tokenize<\/code> function splits the text into words, while
sent_tokenize<\/code> splits it into sentences. Let’s see an example of tokenization in action:<\/p>\n
text = \"Natural language processing (NLP) is a subfield of artificial intelligence that focuses on the interaction between computers and humans using natural language. It involves teaching computers to understand, interpret, and respond to human language in a valuable and meaningful way.\"\nwords = word_tokenize(text)\nsentences = sent_tokenize(text)\n\nprint(words)\nprint(sentences)\n<\/code><\/pre>\n
['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'artificial', 'intelligence', 'that', 'focuses', 'on', 'the', 'interaction', 'between', 'computers', 'and', 'humans', 'using', 'natural', 'language', '.', 'It', 'involves', 'teaching', 'computers', 'to', 'understand', ',', 'interpret', ',', 'and', 'respond', 'to', 'human', 'language', 'in', 'a', 'valuable', 'and', 'meaningful', 'way', '.']\n['Natural language processing (NLP) is a subfield of artificial intelligence that focuses on the interaction between computers and humans using natural language.', 'It involves teaching computers to understand, interpret, and respond to human language in a valuable and meaningful way.']\n<\/code><\/pre>\n
Removing Stop Words<\/h3>\n
nltk<\/code> library provides a list of commonly used stop words in English. Update your Python script as follows:<\/p>\n
from nltk.corpus import stopwords\n\nnltk.download('stopwords')\n\nstop_words = set(stopwords.words(\"english\"))\n\nfiltered_words = [word for word in words if word.casefold() not in stop_words]\n\nprint(filtered_words)\n<\/code><\/pre>\n
['Natural', 'language', 'processing', '(', 'NLP', ')', 'subfield', 'artificial', 'intelligence', 'focuses', 'interaction', 'computers', 'humans', 'using', 'natural', 'language', '.', 'It', 'involves', 'teaching', 'computers', 'understand', ',', 'interpret', ',', 'respond', 'human', 'language', 'valuable', 'meaningful', 'way', '.']\n<\/code><\/pre>\n
Stemming and Lemmatization<\/h3>\n
nltk<\/code> library provides several stemmers for different languages. Let’s see an example of stemming in action:<\/p>\n
from nltk.stem import PorterStemmer\n\nstemmer = PorterStemmer()\n\nstemmed_words = [stemmer.stem(word) for word in filtered_words]\n\nprint(stemmed_words)\n<\/code><\/pre>\n
['natur', 'languag', 'process', '(', 'nlp', ')', 'subfield', 'artifici', 'intellig', 'focus', 'interact', 'comput', 'human', 'use', 'natur', 'languag', '.', 'It', 'involv', 'teach', 'comput', 'understand', ',', 'interpret', ',', 'respond', 'human', 'languag', 'valuabl', 'meaning', 'way', '.']\n<\/code><\/pre>\n
nltk<\/code> library provides a lemmatizer that uses WordNet, a large lexical database of English. Update your Python script as follows:<\/p>\n
from nltk.stem import WordNetLemmatizer\n\nnltk.download('wordnet')\n\nlemmatizer = WordNetLemmatizer()\n\nlemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]\n\nprint(lemmatized_words)\n<\/code><\/pre>\n
['Natural', 'language', 'processing', '(', 'NLP', ')', 'subfield', 'artificial', 'intelligence', 'focus', 'interaction', 'computer', 'human', 'using', 'natural', 'language', '.', 'It', 'involves', 'teaching', 'computer', 'understand', ',', 'interpret', ',', 'respond', 'human', 'language', 'valuable', 'meaningful', 'way', '.']\n<\/code><\/pre>\n
Part-of-Speech Tagging<\/h3>\n
nltk<\/code> library provides a pre-trained POS tagger for English. Let’s see an example of POS tagging in action:<\/p>\n
nltk.download('averaged_perceptron_tagger')\n\npos_tags = nltk.pos_tag(filtered_words)\n\nprint(pos_tags)\n<\/code><\/pre>\n
[('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('(', '('), ('NLP', 'NNP'), (')', ')'), ('subfield', 'NN'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('focuses', 'VBZ'), ('interaction', 'NN'), ('computers', 'NNS'), ('humans', 'NNS'), ('us...'CD'), ('interpret', 'NN'), (',', ','), ('respond', 'VB'), ('human', 'JJ'), ('language', 'NN'), ('valuable', 'JJ'), ('meaningful', 'JJ'), ('way', 'NN'), ('.', '.'), ('It', 'PRP'), ('involves', 'VBZ'), ('publishing', 'VBG'), ('documents', 'NNS'), ...]\n<\/code><\/pre>\n
Named Entity Recognition<\/h3>\n
nltk<\/code> library provides a pre-trained NER model for English. Let’s see an example of NER in action:<\/p>\n
nltk.download('maxent_ne_chunker')\nnltk.download('words')\n\nner_tags = nltk.ne_chunk(pos_tags)\n\nprint(ner_tags)\n<\/code><\/pre>\n
(S\n (GPE Natural\/JJ)\n (ORGANIZATION language\/NN)\n processing\/NN\n (\/(\n PERSON NLP\/NNP\n )\/)\n subfield\/NN\n artificial\/JJ\n intelligence\/NN\n focuses\/VBZ\n interaction\/NN\n computers\/NNS\n humans\/NNS\n using\/VBG\n natural\/JJ\n language\/NN\n .\/.\n It\/PRP\n involves\/VBZ\n teaching\/VBG\n computers\/NNS\n understand\/VB\n ,\/,\n interpret\/NN\n ,\/,\n respond\/VB\n human\/JJ\n language\/NN\n valuable\/JJ\n meaningful\/JJ\n way\/NN\n .\/.)\n<\/code><\/pre>\n
Text Summarization Techniques<\/h2>\n
\n
Extractive Summarization with TF-IDF<\/h3>\n
sklearn<\/code> library provides a built-in implementation of TF-IDF. Update your Python script as follows:<\/p>\n
from sklearn.feature_extraction.text import TfidfVectorizer\n\nvectorizer = TfidfVectorizer()\n\ntfidf_matrix = vectorizer.fit_transform(sentences)\n\nprint(tfidf_matrix.shape)\n<\/code><\/pre>\n
(2, 22)\n<\/code><\/pre>\n
import numpy as np\n\nsentence_scores = np.sum(tfidf_matrix, axis=1)\n\nprint(sentence_scores)\n<\/code><\/pre>\n
[[1.31517604]\n [1.82711287]]\n<\/code><\/pre>\n
sentence_scores<\/code> variable now contains the TF-IDF scores for each sentence. We can sort these scores to find the most important sentences:<\/p>\n
sorted_indices = np.argsort(-sentence_scores, axis=0)\n\nprint(sorted_indices)\n<\/code><\/pre>\n
[[1]\n [0]]\n<\/code><\/pre>\n
sorted_indices<\/code> variable contains the indices of the sentences in descending order of their importance. Using these indices, we can extract the most important sentences from the original text:<\/p>\n
summary_sentences = [sentences[idx] for idx in sorted_indices]\n\nprint(summary_sentences)\n<\/code><\/pre>\n
['It involves teaching computers to understand, interpret, and respond to human language in a valuable and meaningful way.', 'Natural language processing (NLP) is a subfield of artificial intelligence that focuses on the interaction between computers and humans using natural language.']\n<\/code><\/pre>\n
Extractive Summarization with TextRank<\/h3>\n
gensim<\/code> library provides an implementation of the TextRank algorithm.<\/p>\n
from gensim.summarization import summarize\n\nsummary = summarize(text)\n\nprint(summary)\n<\/code><\/pre>\n
Natural language processing (NLP) is a subfield of artificial intelligence that focuses on the interaction between computers and humans using natural language.\nIt involves teaching computers to understand, interpret, and respond to human language in a valuable and meaningful way.\n<\/code><\/pre>\n
summarize<\/code> function from
gensim.summarization<\/code> automatically applies TextRank to the text to generate a summary.<\/p>\n
Evaluation<\/h2>\n
\n
nltk<\/code> library provides an implementation of the ROUGE metric. To use it, you need to install the
py-rouge<\/code> library. Open your command prompt or terminal and run the following command:<\/p>\n
pip install py-rouge\n<\/code><\/pre>\n
from rouge import Rouge\n\nreference = \"Natural language processing (NLP) is a subfield of artificial intelligence that focuses on the interaction between computers and humans using natural language.\"\nsummary = \"It involves teaching computers to understand, interpret, and respond to human language in a valuable and meaningful way.\"\n\nrouge = Rouge()\nscores = rouge.get_scores(summary, reference)\n\nprint(scores)\n<\/code><\/pre>\n
[{'rouge-1': {'f': 0.8965517176249733, 'p': 1.0, 'r': 0.8125}, 'rouge-2': {'f': 0.7999999952000001, 'p': 1.0, 'r': 0.6666666666666666}, 'rouge-l': {'f': 0.8965517176249733, 'p': 1.0, 'r': 0.8125}}]\n<\/code><\/pre>\n
scores<\/code> variable contains the ROUGE scores for the summary compared to the reference. In this case, we obtained an ROUGE-1 f-score of 0.89, indicating a reasonably good summary.<\/p>\n
Conclusion<\/h2>\n