- Hands-On Natural Language Processing with Python
- Rajesh Arumugam Rajalingappaa Shanmugamani
- 304字
- 2021-08-13 16:01:43
Stemming
Stemming is a text preprocessing task for transforming related or similar variants of a word (such as walking) to its base form (to walk), as they share the same meaning. One of the basic transformation stemming actions is to reduce a plural word to its singular form: apples is reduced to apple, for example. While this is a very simple transformation, more complex ones do exist. We will use the popular Porter stemmer, by Martin Porter, to illustrate this, as shown in the following code:
>>> import nltk
>>> from nltk.stem import PorterStemmer
>>> stemming = PorterStemmer()
>>> stemming.stem("enjoying")
'enjoy'
>>> stemming.stem("enjoys")
'enjoy'
>>> stemming.stem("enjoyable")
'enjoy'
In this case, stemming has reduced the different verb (enjoying, enjoy) and adjective (enjoyable) forms of a word to its base form, enjoy. The Porter algorithm used by the stemmer utilizes various language-specific rules (in this case, English) to arrive at the stem words. One of these rules is removing suffixes such as ing from the word, as seen in the aforementioned example code. Stemming does not always produce a stem that is a word by itself, as shown in the following example:
>>> stemmer.stem("variation")
'variat'
>>> stemmer.stem("variate")
'variat'
Here, variat itself is not an English word. The nltk.stem.snowball module includes the snowball stemmers for other different languages, such as French, Spanish, German, and so on. Snowball is a stemming language that can be used to create standard rules for stemming in different languages. Just such as with tokenizers, we can create custom stemmers, using the following regular expressions:
>>> regexp_stemmer = RegexpStemmer("able$|ing$",min=4)
>>> regexp_stemmer.stem("flyable")
'fly'
>>> regexp_stemmer.stem("flying")
'fly'
The regex pattern, able$|ing$ ,removes the suffixes able and ing, if present in a word, and min specifies the minimum length of the stemmed word.
- scikit-learn Cookbook
- Learning Python Web Penetration Testing
- 一步一步學Spring Boot 2:微服務項目實戰
- Django+Vue.js商城項目實戰
- Python網絡爬蟲從入門到實踐(第2版)
- Scratch 3游戲與人工智能編程完全自學教程
- Visual Basic程序設計實驗指導(第4版)
- concrete5 Cookbook
- Python編程實戰
- Android應用案例開發大全(第二版)
- Scratch·愛編程的藝術家
- Python Interviews
- Elasticsearch Essentials
- Anaconda數據科學實戰
- Learning Jakarta Struts 1.2: a concise and practical tutorial