官术网_书友最值得收藏!

Stemming words

Stemming is a technique to remove affixes from a word, ending up with the stem. For example, the stem of cooking is cook, and a good stemming algorithm knows that the ing suffix can be removed. Stemming is most commonly used by search engines for indexing words. Instead of storing all forms of a word, a search engine can store only the stems, greatly reducing the size of index while increasing retrieval accuracy.

One of the most common stemming algorithms is the Porter stemming algorithm by Martin Porter. It is designed to remove and replace well-known suffixes of English words, and its usage in NLTK will be covered in the next section.

Note

The resulting stem is not always a valid word. For example, the stem of cookery is cookeri. This is a feature, not a bug.

How to do it...

NLTK comes with an implementation of the Porter stemming algorithm, which is very easy to use. Simply instantiate the PorterStemmer class and call the stem() method with the word you want to stem:

>>> from nltk.stem import PorterStemmer
>>> stemmer = PorterStemmer()
>>> stemmer.stem('cooking')
'cook'
>>> stemmer.stem('cookery')
'cookeri'

How it works...

The PorterStemmer class knows a number of regular word forms and suffixes and uses this knowledge to transform your input word to a final stem through a series of steps. The resulting stem is often a shorter word, or at least a common form of the word, which has the same root meaning.

There's more...

There are other stemming algorithms out there besides the Porter stemming algorithm, such as the Lancaster stemming algorithm, developed at Lancaster University. NLTK includes it as the LancasterStemmer class. At the time of writing this book, there is no definitive research demonstrating the superiority of one algorithm over the other. However, Porter stemming algorithm is generally the default choice.

All the stemmers covered next inherit from the StemmerI interface, which defines the stem() method. The following is an inheritance diagram that explains this:

The LancasterStemmer class

The functions of the LancasterStemmer class are just like the functions of the PorterStemmer class, but can produce slightly different results. It is known to be slightly more aggressive than the PorterStemmer functions:

>>> from nltk.stem import LancasterStemmer
>>> stemmer = LancasterStemmer()
>>> stemmer.stem('cooking')
'cook'
>>> stemmer.stem('cookery')
'cookery'

The RegexpStemmer class

You can also construct your own stemmer using the RegexpStemmer class. It takes a single regular expression (either compiled or as a string) and removes any prefix or suffix that matches the expression:

>>> from nltk.stem import RegexpStemmer
>>> stemmer = RegexpStemmer('ing')
>>> stemmer.stem('cooking')
'cook'
>>> stemmer.stem('cookery')
'cookery'
>>> stemmer.stem('ingleside')
'leside'

A RegexpStemmer class should only be used in very specific cases that are not covered by the PorterStemmer or the LancasterStemmer class because it can only handle very specific patterns and is not a general-purpose algorithm.

The SnowballStemmer class

The SnowballStemmer class supports 13 non-English languages. It also provides two English stemmers: the original porter algorithm as well as the new English stemming algorithm. To use the SnowballStemmer class, create an instance with the name of the language you are using and then call the stem() method. Here is a list of all the supported languages and an example using the Spanish SnowballStemmer class:

>>> from nltk.stem import SnowballStemmer
>>> SnowballStemmer.languages('danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')
>>> spanish_stemmer = SnowballStemmer('spanish')
>>> spanish_stemmer.stem('hola')
u'hol'

See also

In the next recipe, we will cover Lemmatization, which is quite similar to stemming, but subtly different.

主站蜘蛛池模板: 灵川县| 中江县| 观塘区| 苍溪县| 察雅县| 江川县| 仪陇县| 临邑县| 汾阳市| 新河县| 博罗县| 沁水县| 邢台县| 泗阳县| 阳新县| 彭阳县| 济宁市| 华阴市| 化隆| 安义县| 秭归县| 九江市| 泰兴市| 奉化市| 吉林省| 揭东县| 三台县| 内江市| 晴隆县| 永安市| 东丽区| 若羌县| 花垣县| 广饶县| 和林格尔县| 克什克腾旗| 和田县| 阳新县| 鄂托克前旗| 玛多县| 巫溪县|