Stemming and Lemmatization with Python NLTK

This is a demonstration of stemming and lemmatization for the 18 languages supported by the NLTK 3.6.2 stem package.

How Stemming and Lemmatization Works

Stemming is a process of removing and replacing word suffixes to arrive at a common root form of the word.

English Stemmers and Lemmatizers

For stemming English words with NLTK, you can choose between the PorterStemmer or the LancasterStemmer. The Porter Stemming Algorithm is the oldest stemming algorithm supported in NLTK, originally published in 1979. The Lancaster Stemming Algorithm is much newer, published in 1990, and can be more aggressive than the Porter stemming algorithm.

The WordNet Lemmatizer uses the WordNet Database to lookup lemmas. Lemmas differ from stems in that a lemma is a canonical form of the word, while a stem may not be a real word.

Non-English Stemmers

Stemming for Portuguese is available in NLTK with the RSLPStemmer and also with the SnowballStemmer. Arabic stemming is supported with the ISRIStemmer.

Snowball Stemmers

Snowball is actually a language for creating stemmers, and was added to NLTK version 2.0b9 as the SnowballStemmer class. The NLTK Snowball stemmer currently supports the following languages:

Arabic
Danish
Dutch
English
Finnish
French
German
Hungarian
Italian
Norwegian
Porter
Portuguese
Romanian
Russian
Spanish
Swedish

Natural Language Stemming API

If you'd like to use this thru an API, please see the Stemming API Docs. And for higher limits and premium API access, signup for the Text-Processing RapidAPI.