Lee Hodg

NLP Pipelines with NLTK

Often with Natural Language Processing (NLP) applications a pipeline is useful to take the raw text and process it and extract relevant features before inputting it into a machine learning (ML) algorithm.

Normalization

From the standpoint of an ML algorithm, it may not make much sense to differentiate between different cases of a word – “Tree”, “tree” and “TREE” for example all have the same meaning. For this reason, we may want to convert all text to lower case.

Equally depending on your NLP task you may want to remove special characters like punctuation.

In Python we can achieve both of these objectives very simply:

import re
text = re.sub(r'[^a-zA-Z0-9]', ' ', text.lower())

Tokenization

The name “token” is just a fancy word for “symbol”. In the case of NLP, our symbols are usually words, so “tokenization” just means splitting our text into the individual words.

We could consider doing this in Python simply with “text.split()”, but the NLTK word_tokenize function handles this task in a smarter way.

To use NLTK, first install nltk

import nltk
nltk.download()

Now import the word_tokenize method:

from nltk import word_tokenize, sent_tokenize

Now tokenizing the text into words can be done easily:

# Split text into words using NLTK
word_tokenize(text)

or into sentences with

# Split text into sentences
sent_tokenize(text)

Tweet tokenizer

NLTK also comes with a tokenizer for Twitter tweets

from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)

This is aware of Twitter handles, hashtags and emoticons. With the strip_handles=True and reduce_len=True, arguments it will remove handles and replace repeated character sequences of length 3 or greater with sequences of length 3.

Removing stop words

Stop words are uninformative words like “is”, “are”, “the” in English or “de”, “los”, “por” in Spanish. We may want to remove them to reduce the vocabulary we have to deal with and the complexity of NLP tasks.

NLTK includes stopword corpora for various languages.

from nltk.corpus import stopwords

The stop words in English can be examined with

print(stopwords.words("english"))
>>>['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "

They can be removed from the text with

words = [w for w in words if w not in set(stopwords.words('english'))]

Part of speech tagging

We may want to tag parts of a sentence depending on if they are a noun or verb etc. NLTK makes this pretty easy with pos_tag.

from nltk import pos_tag
pos_tag(word_tokenize("I am eating cheese"))

Outputs:

>>> [('I', 'PRP'), ('am', 'VBP'), ('eating', 'VBG'), ('cheese', 'NN')]

To see what each of these symbols means we can use the handy help feature:

nltk.help.upenn_tagset('VBP')

which in this case tells us that

>>>VBP: verb, present tense, not 3rd person singular
    predominate wrap resort sue twist spill cure lengthen brush terminate
    appear tend stray glisten obtain comprise detest tease attract
    emphasize mold postpone sever return wag ...

Named Entity Recognition

Named entities are typically noun phrases that refer to some specific object, person or place. NLTK has the ne_chunk function to label named entities in text. Note that first the text must be tokenized and tagged.

from nltk import ne_chunk
print(ne_chunk(pos_tag(word_tokenize("Richard Feynman was a professor at Caltech"))))

Outputs:

>>>(S
  (PERSON Richard/NNP)
  (PERSON Feynman/NNP)
  was/VBD
  a/DT
  professor/NN
  at/IN
  (ORGANIZATION Caltech/NNP))

Stemming and Lemmatization

Stemming

Stemming is the process of reduces a word to its stem or root form.

For example, “fishing”, “fished”, and “fisher” reduce to the stem “fish”.

Stemming just uses from rules (e.g. dropping suffixes like “ing” and “ed”) to remove the last few characters to replace the set of words with a common stem. The resulting word may not be meaningful in the given language. For example the Porter algorithm reduces, “argue”, “argued”, “argues”, “arguing”, and “argus” to the stem “argu”.

The purpose of stemming is to reduce complexity whilst retaining the essence of words.

Once again, NLTK makes this process easy for us:

from nltk.stem.porter import PorterStemmer

# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]

Lemmatization

Lemmatization is another technique to reduce words to a normalized form, but in this case the transformation actually uses a dictionary to map different variants of a word back to its root. For example “is”, “was”, “were” may all become “be”.

The default lemmatizer in NLTK uses the Wordnet database, which is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms.

Implementing lemmatization with NLTK looks like:

from nltk.stem.wordnet import WordNetLemmatizer

# Reduce words to their root form
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]

The default is to assume the pos (“Part Of Speech”) is to assume the words are nouns. For example, the verb boring would not go to bore, so we may want to run it again with pos=v meaning verbs to ensure verbs also get lemmatized appropriately:

lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in lemmed]

Bag of words

The bag of words representation treats each document as an unordered collection or “bag” of words.

To obtain a bag of words from a piece of raw text, first apply text processing steps: cleaning, normalizing, tokenizing, stemming, lemmatization etc. Next turn each document into a vector of numbers representing how many times each word appears in a document.

First collect all unique words in the corpus to form the vocabulary, arrange the words in some order and let them form the vector components before counting the number of each word in each document. For example:

| | littl | hous | prairi | mari | lamb | silenc | twinkl | star |
| “Little House on the Prairie” | 1/3 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
|”Mary had a Little Lamb” | 1/3 | 0 | 0 | 1 | 1/2 | 0 | 0 | 0 |
| “The Silence of the Lambs” | 0 | 0 | 0 | 0 | 1/2 | 1 | 0 | 0 |
| “Twinkle Twinkle Little Star” | 1/3 | 0 | 0 | 0 | 0 | 0 | 2 | 1 |

The TF-IDF transform is simply the product of 2 weights:

$$\text{tfidf}(t, d, D) = \text{tf}(t, d) \cdot \text{idf}(t, D)$$

where $tf$ is the term frequency: how many times a term $t$ appears in the given document $d$ divided by the total number of terms in $d$. The term $idf$ is the inverse document frequency: a term inversely proportional to the number of documents in the corpus $D$ in which the term $t$ appears divided by the total number of documents in the corpus. Commonly the logarithm of this term is taken (to try to smooth the resulting values and also prevent zero-division errors) and we have

$$\log{(|D|/|{d \in D: t \in d }|)}$$

This can be done with the scikit-learn TfidfVectorizer:

from sklearn.feature_extraction.text import TfidfVectorizer

# initialize tf-idf vectorizer object
vectorizer = TfidfVectorizer()
# compute bag of word counts and tf-idf values
X = vectorizer.fit_transform(corpus)
# convert sparse matrix to numpy array to view
X.toarray()

Word Embeddings

We could just represent words with one-hot encoding, but as the size of the vocabulary grows the dimension of the embedding space would also grow extremely large.

What we’d like is to represent words in some fixed-sized vector space, and also it would be nice if when 2 words are close in meaning they are close in the vector space (cosine similarity).

Word2Vec

Word2Vec is one of the most popular embeddings used in practice.

Word2Vec has uses a neural-network model that is able to predict a word given neighbouring words, or vice versa predict neighbouring words given the word.

the quick brown fox <jumps> over the lazy dog

The continuous bag of words (CBoW) model takes neighbouring words and tries to predict the missing word. Whereas the Skip-gram model takes the middle word and tries to predict the neighbouring words.

This prediction task is an ancillary task, and what we really want is the representation of the word that can be gleaned from the hidden layer of the network. This will be what forms the representation of the word.

Pipeline

We can put all this together into and sklean Pipeline with the following code

import nltk
nltk.download()

import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer


def load_data():
    """
    Load your data
    .
    .
    .
    """
    return X, y


def tokenize(text):
    """
    Clean and tokenize
    """
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens



def main():
    X, y = load_data()
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', RandomForestClassifier())
    ])

    # train classifier
    pipeline.fit(X_train, y_train)

    # predict on test data
    y_pred = pipeline.predict(X_test)

    # display results
    accuracy = (y_pred == y_test).mean()
    print("Accuracy:", accuracy)


main()

Leave a Comment