官术网_书友最值得收藏!

Stemming

Stemming is a text preprocessing task for transforming related or similar variants of a word (such as walking) to its base form (to walk), as they share the same meaning. One of the basic transformation stemming actions is to reduce a plural word to its singular form: apples is reduced to apple, for example. While this is a very simple transformation, more complex ones do exist. We will use the popular Porter stemmer, by Martin Porter, to illustrate this, as shown in the following code:

>>> import nltk
>>> from nltk.stem import PorterStemmer
>>> stemming = PorterStemmer()
>>> stemming.stem("enjoying")
'enjoy'
>>> stemming.stem("enjoys")
'enjoy'
>>> stemming.stem("enjoyable")
'enjoy'

In this case, stemming has reduced the different verb (enjoying, enjoy) and adjective (enjoyable) forms of a word to its base form, enjoy. The Porter algorithm used by the stemmer utilizes various language-specific rules (in this case, English) to arrive at the stem words. One of these rules is removing suffixes such as ing from the word, as seen in the aforementioned example code. Stemming does not always produce a stem that is a word by itself, as shown in the following example:

>>> stemmer.stem("variation")
'variat'
>>> stemmer.stem("variate")
'variat'

Here, variat itself is not an English word. The nltk.stem.snowball module includes the snowball stemmers for other different languages, such as French, Spanish, German, and so on. Snowball is a stemming language that can be used to create standard rules for stemming in different languages. Just such as with tokenizers, we can create custom stemmers, using the following regular expressions:

>>> regexp_stemmer = RegexpStemmer("able$|ing$",min=4)
>>> regexp_stemmer.stem("flyable")
'fly'
>>> regexp_stemmer.stem("flying")
'fly'

The regex pattern, able$|ing$ ,removes the suffixes able and ing, if present in a word, and min specifies the minimum length of the stemmed word.

主站蜘蛛池模板: 县级市| 呼伦贝尔市| 张家港市| 灵宝市| 宁明县| 石首市| 南漳县| 白城市| 普定县| 滨州市| 千阳县| 江都市| 淮阳县| 拉孜县| 商丘市| 永和县| 吉水县| 盐边县| 时尚| 金山区| 通城县| 青龙| 五河县| 新民市| 始兴县| 蒲城县| 富阳市| 莱芜市| 四川省| 来宾市| 高陵县| 南涧| 泽普县| 留坝县| 习水县| 象州县| 西峡县| 土默特右旗| 云南省| 城口县| 冷水江市|