Word Tokenization with Python NLTK

This is a demonstration of the various tokenizers provided by NLTK 3.6.2.

How Text Tokenization Works

Tokenization is a way to split text into tokens. These tokens could be paragraphs, sentences, or individual words. NLTK provides a number of tokenizers in the tokenize module. This demo shows how 5 of them work.

The text is first tokenized into sentences using the PunktSentenceTokenizer. Then each sentence is tokenized into words using 3 different word tokenizers:

TreebankWordTokenizer
WordPunctTokenizer
WhitespaceTokenizer

The spaCy tokenizer and pattern tokenizer do their own sentence and word tokenization, and are included to show how this libraries tokenize text before further parsing.

The initial example text provides 2 sentences that demonstrate how each word tokenizer handles non-ascii characters and the simple punctuation of contractions.