- Hands-On Natural Language Processing with Python
- Rajesh Arumugam Rajalingappaa Shanmugamani
- 404字
- 2021-08-13 16:01:43
Tokenization
Word tokens are the basic units of text involved in any NLP task. The first step, when processing text, is to split it into tokens. NLTK provides different types of tokenizers for doing this. We will look at how to tokenize Twitter comments from the Twitter samples corpora, available in NLTK. From now on, all of the illustrated code can be run by using the standard Python interpreter on the command line:
>>> import nltk
>>> from nltk.corpus import twitter_samples as ts
>>> ts.fileids()
['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430- 223406.json']
>>> samples_tw = ts.strings('tweets.20150430-223406.json')
>>> samples_tw[20]
"@B0MBSKARE the anti-Scottish feeling is largely a product of Tory press scaremongering. In practice most people won't give a toss!"
>>> from nltk.tokenize import word_tokenize as wtoken
>>> wtoken(samples_tw[20])
['@', 'B0MBSKARE', 'the', 'anti-Scottish', 'feeling', 'is', 'largely', 'a', 'product', 'of', 'Tory', 'press', 'scaremongering', '.', 'In', 'practice', 'most', 'people', 'wo', "n't", 'give', 'a', 'toss', '!']
To split text based on punctuation and white spaces, NLTK provides the wordpunct_tokenize tokenizer. This will also tokenize the punctuation characters. This step is illustrated in the following code:
>>> samples_tw[20]
"@B0MBSKARE the anti-Scottish feeling is largely a product of Tory press scaremongering. In practice most people won't give a toss!"
>>> wordpunct_tokenize(samples_tw[20])
['@', 'B0MBSKARE', 'the', 'anti', '-', 'Scottish', 'feeling', 'is', 'largely', 'a', 'product', 'of', 'Tory', 'press', 'scaremongering', '.', 'In', 'practice', 'most', 'people', 'won', "'", 't', 'give', 'a', 'toss', '!']
As you can see, some of the words between hyphens are also tokenized as well as other punctuations mark, compared to the word_tokenize. We can build custom tokenizers using NLTK's regular expression tokenizer, as shown in the following code:
>>> from nltk import regexp_tokenize
>>> patn = '\w+'
>>> regexp_tokenize(samples_tw[20],patn)
['B0MBSKARE', 'the', 'anti', 'Scottish', 'feeling', 'is', 'largely', 'a', 'product', 'of', 'Tory', 'press', 'scaremongering', 'In', 'practice', 'most', 'people', 'won', 't', 'give', 'a', 'toss']
In the preceding code, we used a simple regular expression (regexp) to detect a word containing only alphanumeric characters. As another example, we will use a regular expression that detects words along with a few punctuation characters:
>>> patn = '\w+|[!,\-,]'
>>> regexp_tokenize(samples_tw[20],patn)
['B0MBSKARE', 'the', 'anti', '-', 'Scottish', 'feeling', 'is', 'largely', 'a', 'product', 'of', 'Tory', 'press', 'scaremongering', 'In', 'practice', 'most', 'people', 'won', 't', 'give', 'a', 'toss', '!']
By changing the regexp pattern to include punctuation marks, we were able to tokenize the characters in the result, which is apparent through the tokens ! and - being present in the resulting Python list.
- Embedded Linux Projects Using Yocto Project Cookbook
- Facebook Application Development with Graph API Cookbook
- Designing Hyper-V Solutions
- MySQL數據庫管理與開發實踐教程 (清華電腦學堂)
- Visual C++數字圖像處理技術詳解
- 微信小程序項目開發實戰
- Java實戰(第2版)
- Python全棧數據工程師養成攻略(視頻講解版)
- Getting Started with Python and Raspberry Pi
- Kivy Cookbook
- 深入理解C指針
- Cocos2d-x by Example:Beginner's Guide(Second Edition)
- C指針原理揭秘:基于底層實現機制
- Mudbox 2013 Cookbook
- SQL Server on Linux