官术网_书友最值得收藏!

Stemming

Stemming is a text preprocessing task for transforming related or similar variants of a word (such as walking) to its base form (to walk), as they share the same meaning. One of the basic transformation stemming actions is to reduce a plural word to its singular form: apples is reduced to apple, for example. While this is a very simple transformation, more complex ones do exist. We will use the popular Porter stemmer, by Martin Porter, to illustrate this, as shown in the following code:

>>> import nltk
>>> from nltk.stem import PorterStemmer
>>> stemming = PorterStemmer()
>>> stemming.stem("enjoying")
'enjoy'
>>> stemming.stem("enjoys")
'enjoy'
>>> stemming.stem("enjoyable")
'enjoy'

In this case, stemming has reduced the different verb (enjoying, enjoy) and adjective (enjoyable) forms of a word to its base form, enjoy. The Porter algorithm used by the stemmer utilizes various language-specific rules (in this case, English) to arrive at the stem words. One of these rules is removing suffixes such as ing from the word, as seen in the aforementioned example code. Stemming does not always produce a stem that is a word by itself, as shown in the following example:

>>> stemmer.stem("variation")
'variat'
>>> stemmer.stem("variate")
'variat'

Here, variat itself is not an English word. The nltk.stem.snowball module includes the snowball stemmers for other different languages, such as French, Spanish, German, and so on. Snowball is a stemming language that can be used to create standard rules for stemming in different languages. Just such as with tokenizers, we can create custom stemmers, using the following regular expressions:

>>> regexp_stemmer = RegexpStemmer("able$|ing$",min=4)
>>> regexp_stemmer.stem("flyable")
'fly'
>>> regexp_stemmer.stem("flying")
'fly'

The regex pattern, able$|ing$ ,removes the suffixes able and ing, if present in a word, and min specifies the minimum length of the stemmed word.

主站蜘蛛池模板: 疏附县| 四子王旗| 托里县| 江安县| 饶河县| 大宁县| 廉江市| 内乡县| 遵化市| 清镇市| 宜城市| 松桃| 禄劝| 岑溪市| 宜兰市| 永登县| 余干县| 渝中区| 波密县| 平塘县| 济宁市| 沽源县| 旺苍县| 余姚市| 额济纳旗| 彩票| 绩溪县| 全州县| 赣州市| 牡丹江市| 文登市| 霞浦县| 本溪市| 丹东市| 灵山县| 西畴县| 芷江| 闵行区| 庄河市| 靖远县| 靖远县|