Tagging, Chunking & Named Entity Recognition with NLTK

This is a demonstration of NLTK part of speech taggers and NLTK chunkers using NLTK 3.6.2. These taggers can assign part-of-speech tags to each word in your text. They can also identify certain phrases/chunks and named entities.

How Part of Speech Tagging, Phrase Chunking, and NER Works

Trained Part of Speech Taggers

The default part of speech tagger is a classifier based tagger trained on the PENN Treebank corpus. The PENN Treebank corpus is composed of news articles from the reuters newswire. That means the tagger is more likely to be correct on text that looks like a news article, and less accurate on text that doesn't.

Similarly, spaCy taggers have been trained on written text such as news, blogs and media in the following languages:

English
German
French
Spanish
Portuguese
Italian
Dutch
Greek

All the other taggers have been trained on part-of-speech tagged NLTK corpora using train_tagger.py from nltk-trainer. These NLTK taggers cover the following languages:

Dutch
English
Portuguese
Spanish

Trained Phrase Chunkers and Named Entity Recognizers

The default chunker is a classifier based chunker trained on the ACE corpus. This means it recognizes noun phrases and named entities, such as locations, names, organizations, and more. It will only work well with an English tagger, and will work best with the default tagger.

All other chunkers have been trained on chunked or parsed NLTK corpora using train_chunker.py from nltk-trainer. These NLTK chunkers cover the following languages:

Dutch
English
Portuguese
Spanish

Natural Language Tagging and Phrase Extraction APIs

If you'd like to use this thru an API, please see the API docs for Tagging & Chunking and Phrase Extraction & Named Entity Recognition. And for higher limits and premium API access, signup for the Text-Processing RapidAPI.