官术网_书友最值得收藏!

Tokenizing sentences into words

In this recipe, we'll split a sentence into individual words. The simple task of creating a list of words from a string is an essential part of all text processing.

How to do it...

Basic word tokenization is very simple; use the word_toke nize() function:

>>> from nltk.tokenize import word_tokenize
>>> word_tokenize('Hello World.')
['Hello', 'World', '.']

How it works...

The word_tokenize() function is a wrapper function that calls tokenize() on an instance of the TreebankWordTokenizer class. It's equivalent to the following code:

>>> from nltk.tokenize import TreebankWordTokenizer
>>> tokenizer = TreebankWordTokenizer()
>>> tokenizer.tokenize('Hello World.')
['Hello', 'World', '.']

It works by separating words using spaces and punctuation. And as you can see, it does not discard the punctuation, allowing you to decide what to do with it.

There's more...

Ignoring the obviously named WhitespaceTokenizer and SpaceTokenizer, there are two other word tokenizers worth looking at: PunktWordTokenizer and WordPunctTokenizer. These differ from TreebankWordTokenizer by how they handle punctuation and contractions, but they all inherit from TokenizerI. The inheritance tree looks like what's shown in the following diagram:

Separating contractions

The TreebankWordTokenizer class uses conventions found in the Penn Treebank corpus. This corpus is one of the most used corpora for natural language processing, and was created in the 1980s by annotating articles from the Wall Street Journal. We'll be using this later in Chapter 4, Part-of-speech Tagging, and Chapter 5, Extracting Chunks.

One of the tokenizer's most significant conventions is to separate contractions. For example, consider the following code:

>>> word_tokenize("can't")
['ca', "n't"]

If you find this convention unacceptable, then read on for alternatives, and see the next recipe for tokenizing with regular expressions.

PunktWordTokenizer

An alternative word tokenizer is PunktWordTokenizer. It splits on punctuation, but keeps it with the word instead of creating separate tokens, as shown in the following code:

>>> from nltk.tokenize import PunktWordTokenizer
>>> tokenizer = PunktWordTokenizer()
>>> tokenizer.tokenize("Can't is a contraction.")
['Can', "'t", 'is', 'a', 'contraction.']

WordPunctTokenizer

Another alternative word tokenizer is WordPunctTokenizer. It splits all punctuation into separate tokens:

>>> from nltk.tokenize import WordPunctTokenizer
>>> tokenizer = WordPunctTokenizer()
>>> tokenizer.tokenize("Can't is a contraction.")
['Can', "'", 't', 'is', 'a', 'contraction', '.']

See also

For more control over word tokenization, you'll want to read the next recipe to learn how to use regular expressions and the RegexpTokenizer for tokenization. And for more on the Penn Treebank corpus, visit http://www.cis.upenn.edu/~treebank/.

主站蜘蛛池模板: 屏边| 津市市| 曲阜市| 增城市| 寻乌县| 昭平县| 丰顺县| 柳河县| 克什克腾旗| 南康市| 随州市| 华安县| 岳普湖县| 毕节市| 襄樊市| 曲水县| 诸城市| 巴彦淖尔市| 海林市| 从化市| 同德县| 海门市| 庆安县| 项城市| 长宁县| 巴马| 集贤县| 东乡| 呼玛县| 兰西县| 山阴县| 林口县| 藁城市| 平山县| 吉木萨尔县| 孝义市| 乐亭县| 张家口市| 安吉县| 西盟| 微山县|