- Hands-On Natural Language Processing with Python
- Rajesh Arumugam Rajalingappaa Shanmugamani
- 304字
- 2021-08-13 16:01:43
Stemming
Stemming is a text preprocessing task for transforming related or similar variants of a word (such as walking) to its base form (to walk), as they share the same meaning. One of the basic transformation stemming actions is to reduce a plural word to its singular form: apples is reduced to apple, for example. While this is a very simple transformation, more complex ones do exist. We will use the popular Porter stemmer, by Martin Porter, to illustrate this, as shown in the following code:
>>> import nltk
>>> from nltk.stem import PorterStemmer
>>> stemming = PorterStemmer()
>>> stemming.stem("enjoying")
'enjoy'
>>> stemming.stem("enjoys")
'enjoy'
>>> stemming.stem("enjoyable")
'enjoy'
In this case, stemming has reduced the different verb (enjoying, enjoy) and adjective (enjoyable) forms of a word to its base form, enjoy. The Porter algorithm used by the stemmer utilizes various language-specific rules (in this case, English) to arrive at the stem words. One of these rules is removing suffixes such as ing from the word, as seen in the aforementioned example code. Stemming does not always produce a stem that is a word by itself, as shown in the following example:
>>> stemmer.stem("variation")
'variat'
>>> stemmer.stem("variate")
'variat'
Here, variat itself is not an English word. The nltk.stem.snowball module includes the snowball stemmers for other different languages, such as French, Spanish, German, and so on. Snowball is a stemming language that can be used to create standard rules for stemming in different languages. Just such as with tokenizers, we can create custom stemmers, using the following regular expressions:
>>> regexp_stemmer = RegexpStemmer("able$|ing$",min=4)
>>> regexp_stemmer.stem("flyable")
'fly'
>>> regexp_stemmer.stem("flying")
'fly'
The regex pattern, able$|ing$ ,removes the suffixes able and ing, if present in a word, and min specifies the minimum length of the stemmed word.
- FuelPHP Application Development Blueprints
- Redis入門指南(第3版)
- OpenCV實例精解
- 零基礎學Python數據分析(升級版)
- Building Minecraft Server Modifications
- Mastering Python Networking
- Mastering ServiceNow(Second Edition)
- 琢石成器:Windows環境下32位匯編語言程序設計
- Getting Started with Hazelcast(Second Edition)
- Building Android UIs with Custom Views
- 區塊鏈國產化實踐指南:基于Fabric 2.0
- 硬件產品設計與開發:從原型到交付
- 超簡單:Photoshop+JavaScript+Python智能修圖與圖像自動化處理
- DevOps 精要:業務視角
- 從零開始學Unity游戲開發:場景+角色+腳本+交互+體驗+效果+發布