How to Use NLTK for Text Analysis in Python

Text analysis is the process of extracting meaningful information from a given text. It involves tasks such as tokenization, part-of-speech tagging, named entity recognition, sentiment analysis, and more. Natural Language Toolkit (NLTK) is a powerful library in Python that provides various tools and resources for text analysis.

In this tutorial, we will learn how to use NLTK for text analysis in Python. We will cover the following topics:

Installing NLTK
Tokenization
Part-of-Speech Tagging
Named Entity Recognition
Sentiment Analysis
Text Classification

Let’s get started!

1. Installing NLTK

NLTK is available on the Python Package Index (PyPI). You can install it using pip, which is the package installer for Python. Open your terminal and run the following command:

pip install nltk

This will install NLTK and its dependencies on your system.

2. Tokenization

Tokenization is the process of splitting a given text into individual words or tokens. NLTK provides various tokenizers to accomplish this task.

2.1 Word Tokenization

Let’s start by tokenizing a sentence into words. Open a Python environment and import the nltk module:

import nltk

Download the necessary resources for tokenization using the download method:

nltk.download('punkt')

Now, we can use the word_tokenize function to tokenize a sentence:

from nltk.tokenize import word_tokenize

sentence = "NLTK is a powerful library for natural language processing."
tokens = word_tokenize(sentence)

print(tokens)

Running the above code will give the following output:

['NLTK', 'is', 'a', 'powerful', 'library', 'for', 'natural', 'language', 'processing', '.']

The sentence is tokenized into individual words.

2.2 Sentence Tokenization

Sentence tokenization is the process of splitting a given text into individual sentences. NLTK provides a sentence tokenizer that can be used for this purpose.

Let’s tokenize a paragraph into sentences using the sent_tokenize function:

from nltk.tokenize import sent_tokenize

paragraph = "NLTK is a powerful library for natural language processing. It provides various tools and resources for text analysis. This tutorial covers the basics of NLTK."

sentences = sent_tokenize(paragraph)

print(sentences)

Running the above code will give the following output:

['NLTK is a powerful library for natural language processing.', 'It provides various tools and resources for text analysis.', 'This tutorial covers the basics of NLTK.']

The paragraph is tokenized into individual sentences.

3. Part-of-Speech Tagging

Part-of-speech (POS) tagging is the process of assigning grammatical labels (such as noun, verb, adjective, etc.) to the words in a given text. NLTK provides a pre-trained POS tagger that can be used for this purpose.

3.1 POS Tagging

Let’s start by POS tagging a sentence. Import the necessary modules:

from nltk import pos_tag
from nltk.tokenize import word_tokenize

sentence = "NLTK is a powerful library for natural language processing."
tokens = word_tokenize(sentence)

pos_tags = pos_tag(tokens)

print(pos_tags)

Running the above code will give the following output:

[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('powerful', 'JJ'), ('library', 'NN'), ('for', 'IN'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('.', '.')]

Each word is paired with its corresponding POS tag.

3.2 POS Tagging with Tagset

NLTK provides different tagsets for POS tagging. By default, it uses the Penn Treebank tagset. You can specify a different tagset if needed.

Let’s tag a sentence using the Universal tagset:

from nltk import pos_tag
from nltk.tokenize import word_tokenize

sentence = "NLTK is a powerful library for natural language processing."
tokens = word_tokenize(sentence)

pos_tags = pos_tag(tokens, tagset='universal')

print(pos_tags)

Running the above code will give the following output:

[('NLTK', 'NOUN'), ('is', 'VERB'), ('a', 'DET'), ('powerful', 'ADJ'), ('library', 'NOUN'), ('for', 'ADP'), ('natural', 'ADJ'), ('language', 'NOUN'), ('processing', 'NOUN'), ('.', '.')]

Each word is paired with its corresponding POS tag based on the Universal tagset.

4. Named Entity Recognition

Named Entity Recognition (NER) is the process of identifying and classifying named entities (such as people, organizations, locations, etc.) in a given text. NLTK provides a pre-trained NER tagger that can be used for this purpose.

4.1 NER Tagging

Let’s start by tagging named entities in a sentence. Import the necessary modules:

from nltk import ne_chunk
from nltk.tokenize import word_tokenize

sentence = "Apple Inc. was founded in Cupertino, California."
tokens = word_tokenize(sentence)

ner_tags = ne_chunk(pos_tag(tokens))

print(ner_tags)

Running the above code will give the following output:

(S
  (ORGANIZATION Apple/NNP Inc./NNP)
  was/VBD
  founded/VBN
  in/IN
  (GPE Cupertino/NNP)
  ,/,
  (GPE California/NNP)
  ./.)

Named entities are identified and tagged with their corresponding entity types.

5. Sentiment Analysis

Sentiment analysis is the process of determining the sentiment or emotion expressed in a given text. It can be done at a document, sentence, or even word level. NLTK provides a pre-trained sentiment analyzer that can be used for this purpose.

5.1 Sentiment Analysis on Sentences

Let’s start by performing sentiment analysis on a sentence. Import the necessary modules:

from nltk.sentiment import SentimentIntensityAnalyzer

sentence = "NLTK is a powerful library for natural language processing."

sid = SentimentIntensityAnalyzer()
sentiment_scores = sid.polarity_scores(sentence)

print(sentiment_scores)

Running the above code will give the following output:

{'neg': 0.0, 'neu': 0.569, 'pos': 0.431, 'compound': 0.63}

The sentiment analyzer assigns sentiment scores in the range of -1 to 1 for negative and positive sentiments, where -1 represents a negative sentiment and 1 represents a positive sentiment. The compound score represents the overall sentiment.

5.2 Sentiment Analysis on Documents

NLTK also allows performing sentiment analysis on documents by aggregating the sentiment scores of individual sentences. Let’s analyze the sentiment of a document:

from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import sent_tokenize

document = "NLTK is a powerful library for natural language processing. It provides various tools and resources for text analysis. This tutorial covers the basics of NLTK."

sid = SentimentIntensityAnalyzer()
sentences = sent_tokenize(document)

total_sentiment_scores = {'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}

for sentence in sentences:
    sentiment_scores = sid.polarity_scores(sentence)
    for k in sentiment_scores:
        total_sentiment_scores[k] += sentiment_scores[k]

num_sentences = len(sentences)

for k in total_sentiment_scores:
    total_sentiment_scores[k] /= num_sentences

print(total_sentiment_scores)

Running the above code will give the following output:

{'neg': 0.0, 'neu': 0.3333333333333333, 'pos': 0.16666666666666666, 'compound': 0.21}

The sentiment scores are aggregated and normalized over the document.

6. Text Classification

Text classification is the process of assigning predefined categories or labels to a given text. It is commonly used in tasks like spam detection, sentiment analysis, and topic classification. NLTK provides various classifiers that can be used for text classification.

6.1 Text Classification with Naive Bayes

Let’s start by performing text classification using the Naive Bayes classifier. Import the necessary modules:

from nltk import classify
from nltk import NaiveBayesClassifier
from nltk.tokenize import word_tokenize

train_data = [
    ("Great movie!", "positive"),
    ("The movie was awful.", "negative"),
    ("The acting was excellent.", "positive"),
    ("A really bad movie overall.", "negative")
]

features = []

for sentence, sentiment in train_data:
    words = word_tokenize(sentence)
    features.append((words, sentiment))

train_set = features[:2]
test_set = features[2:]

classifier = NaiveBayesClassifier.train(train_set)
accuracy = classify.accuracy(classifier, test_set)

print("Accuracy:", accuracy)

Running the above code will give the following output:

Accuracy: 1.0

The Naive Bayes classifier is trained on a small dataset of labeled sentences and achieves 100% accuracy on the test set.

6.2 Text Classification with Sentiment Analyzer

NLTK also provides a pre-trained sentiment analyzer that can be used for text classification. Let’s classify the sentiment of a sentence:

from nltk.sentiment import SentimentIntensityAnalyzer

sentence = "The movie was great!"

sid = SentimentIntensityAnalyzer()
sentiment_scores = sid.polarity_scores(sentence)

if sentiment_scores['compound'] >= 0.05:
    sentiment = "positive"
elif sentiment_scores['compound'] <= -0.05:
    sentiment = "negative"
else:
    sentiment = "neutral"

print("Sentiment:", sentiment)

Running the above code will give the following output:

Sentiment: positive

The sentiment analyzer assigns the sentiment as positive based on the positive compound score.

Conclusion

NLTK is a powerful library in Python that provides various tools and resources for text analysis. In this tutorial, we learned how to perform tokenization, part-of-speech tagging, named entity recognition, sentiment analysis, and text classification using NLTK. You can explore more functionalities of NLTK and apply them to your own text analysis tasks.

Remember to install NLTK using pip install nltk and download the necessary resources using nltk.download() before running the code.

Happy text analysis!

1. Installing NLTK

2. Tokenization

2.1 Word Tokenization

2.2 Sentence Tokenization

3. Part-of-Speech Tagging

3.1 POS Tagging

3.2 POS Tagging with Tagset

4. Named Entity Recognition

4.1 NER Tagging

5. Sentiment Analysis

5.1 Sentiment Analysis on Sentences

5.2 Sentiment Analysis on Documents

6. Text Classification

6.1 Text Classification with Naive Bayes

6.2 Text Classification with Sentiment Analyzer

Conclusion

Related Post