官术网_书友最值得收藏!

Tokenization

Word tokens are the basic units of text involved in any NLP task. The first step, when processing text, is to split it into tokens. NLTK provides different types of tokenizers for doing this. We will look at how to tokenize Twitter comments from the Twitter samples corpora, available in NLTK. From now on, all of the illustrated code can be run by using the standard Python interpreter on the command line:

>>> import nltk
>>> from nltk.corpus import twitter_samples as ts
>>> ts.fileids()
['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430- 223406.json']
>>> samples_tw = ts.strings('tweets.20150430-223406.json')
>>> samples_tw[20]
"@B0MBSKARE the anti-Scottish feeling is largely a product of Tory press scaremongering. In practice most people won't give a toss!"
>>> from nltk.tokenize import word_tokenize as wtoken
>>> wtoken(samples_tw[20])
['@', 'B0MBSKARE', 'the', 'anti-Scottish', 'feeling', 'is', 'largely', 'a', 'product', 'of', 'Tory', 'press', 'scaremongering', '.', 'In', 'practice', 'most', 'people', 'wo', "n't", 'give', 'a', 'toss', '!']

To split text based on punctuation and white spaces, NLTK provides the wordpunct_tokenize tokenizer. This will also tokenize the punctuation characters. This step is illustrated in the following code:

>>> samples_tw[20]
"@B0MBSKARE the anti-Scottish feeling is largely a product of Tory press scaremongering. In practice most people won't give a toss!"
>>> wordpunct_tokenize(samples_tw[20])
['@', 'B0MBSKARE', 'the', 'anti', '-', 'Scottish', 'feeling', 'is', 'largely', 'a', 'product', 'of', 'Tory', 'press', 'scaremongering', '.', 'In', 'practice', 'most', 'people', 'won', "'", 't', 'give', 'a', 'toss', '!']

As you can see, some of the words between hyphens are also tokenized as well as other punctuations mark, compared to the word_tokenize. We can build custom tokenizers using NLTK's regular expression tokenizer, as shown in the following code:

>>> from nltk import regexp_tokenize
>>> patn = '\w+'
>>> regexp_tokenize(samples_tw[20],patn)
['B0MBSKARE', 'the', 'anti', 'Scottish', 'feeling', 'is', 'largely', 'a', 'product', 'of', 'Tory', 'press', 'scaremongering', 'In', 'practice', 'most', 'people', 'won', 't', 'give', 'a', 'toss']

In the preceding code, we used a simple regular expression (regexp) to detect a word containing only alphanumeric characters. As another example, we will use a regular expression that detects words along with a few punctuation characters:

>>> patn = '\w+|[!,\-,]'
>>> regexp_tokenize(samples_tw[20],patn)
['B0MBSKARE', 'the', 'anti', '-', 'Scottish', 'feeling', 'is', 'largely', 'a', 'product', 'of', 'Tory', 'press', 'scaremongering', 'In', 'practice', 'most', 'people', 'won', 't', 'give', 'a', 'toss', '!']

By changing the regexp pattern to include punctuation marks, we were able to tokenize the characters in the result, which is apparent through the tokens ! and - being present in the resulting Python list.

主站蜘蛛池模板: 中山市| 乐亭县| 明溪县| 长乐市| 化德县| 买车| 随州市| 灌南县| 康保县| 临武县| 甘南县| 百色市| 苗栗市| 德安县| 揭东县| 玉田县| 清新县| 双牌县| 丹江口市| 合阳县| 岫岩| 娄烦县| 福贡县| 筠连县| 商洛市| 布尔津县| 长兴县| 诸暨市| 淮安市| 隆尧县| 霍林郭勒市| 古浪县| 屏东市| 崇左市| 吉安县| 永靖县| 嘉祥县| 封开县| 天台县| 肃南| 西畴县|