官术网_书友最值得收藏!

Tokenization

Word tokens are the basic units of text involved in any NLP task. The first step, when processing text, is to split it into tokens. NLTK provides different types of tokenizers for doing this. We will look at how to tokenize Twitter comments from the Twitter samples corpora, available in NLTK. From now on, all of the illustrated code can be run by using the standard Python interpreter on the command line:

>>> import nltk
>>> from nltk.corpus import twitter_samples as ts
>>> ts.fileids()
['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430- 223406.json']
>>> samples_tw = ts.strings('tweets.20150430-223406.json')
>>> samples_tw[20]
"@B0MBSKARE the anti-Scottish feeling is largely a product of Tory press scaremongering. In practice most people won't give a toss!"
>>> from nltk.tokenize import word_tokenize as wtoken
>>> wtoken(samples_tw[20])
['@', 'B0MBSKARE', 'the', 'anti-Scottish', 'feeling', 'is', 'largely', 'a', 'product', 'of', 'Tory', 'press', 'scaremongering', '.', 'In', 'practice', 'most', 'people', 'wo', "n't", 'give', 'a', 'toss', '!']

To split text based on punctuation and white spaces, NLTK provides the wordpunct_tokenize tokenizer. This will also tokenize the punctuation characters. This step is illustrated in the following code:

>>> samples_tw[20]
"@B0MBSKARE the anti-Scottish feeling is largely a product of Tory press scaremongering. In practice most people won't give a toss!"
>>> wordpunct_tokenize(samples_tw[20])
['@', 'B0MBSKARE', 'the', 'anti', '-', 'Scottish', 'feeling', 'is', 'largely', 'a', 'product', 'of', 'Tory', 'press', 'scaremongering', '.', 'In', 'practice', 'most', 'people', 'won', "'", 't', 'give', 'a', 'toss', '!']

As you can see, some of the words between hyphens are also tokenized as well as other punctuations mark, compared to the word_tokenize. We can build custom tokenizers using NLTK's regular expression tokenizer, as shown in the following code:

>>> from nltk import regexp_tokenize
>>> patn = '\w+'
>>> regexp_tokenize(samples_tw[20],patn)
['B0MBSKARE', 'the', 'anti', 'Scottish', 'feeling', 'is', 'largely', 'a', 'product', 'of', 'Tory', 'press', 'scaremongering', 'In', 'practice', 'most', 'people', 'won', 't', 'give', 'a', 'toss']

In the preceding code, we used a simple regular expression (regexp) to detect a word containing only alphanumeric characters. As another example, we will use a regular expression that detects words along with a few punctuation characters:

>>> patn = '\w+|[!,\-,]'
>>> regexp_tokenize(samples_tw[20],patn)
['B0MBSKARE', 'the', 'anti', '-', 'Scottish', 'feeling', 'is', 'largely', 'a', 'product', 'of', 'Tory', 'press', 'scaremongering', 'In', 'practice', 'most', 'people', 'won', 't', 'give', 'a', 'toss', '!']

By changing the regexp pattern to include punctuation marks, we were able to tokenize the characters in the result, which is apparent through the tokens ! and - being present in the resulting Python list.

主站蜘蛛池模板: 绿春县| 五指山市| 文水县| 邯郸市| 宜都市| 钟山县| 湘阴县| 高安市| 汤阴县| 思南县| 专栏| 廉江市| 简阳市| 江津市| 滦南县| 鹿邑县| 深水埗区| 长泰县| 汾西县| 百色市| 娱乐| 枣庄市| 新化县| 惠东县| 山阳县| 图木舒克市| 黄石市| 玉溪市| 扎囊县| 正宁县| 温泉县| 平凉市| 拉萨市| 大同市| 花莲县| 盈江县| 汉源县| 工布江达县| 左权县| 工布江达县| 昆明市|