- Hands-On Natural Language Processing with Python
- Rajesh Arumugam Rajalingappaa Shanmugamani
- 404字
- 2021-08-13 16:01:43
Tokenization
Word tokens are the basic units of text involved in any NLP task. The first step, when processing text, is to split it into tokens. NLTK provides different types of tokenizers for doing this. We will look at how to tokenize Twitter comments from the Twitter samples corpora, available in NLTK. From now on, all of the illustrated code can be run by using the standard Python interpreter on the command line:
>>> import nltk
>>> from nltk.corpus import twitter_samples as ts
>>> ts.fileids()
['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430- 223406.json']
>>> samples_tw = ts.strings('tweets.20150430-223406.json')
>>> samples_tw[20]
"@B0MBSKARE the anti-Scottish feeling is largely a product of Tory press scaremongering. In practice most people won't give a toss!"
>>> from nltk.tokenize import word_tokenize as wtoken
>>> wtoken(samples_tw[20])
['@', 'B0MBSKARE', 'the', 'anti-Scottish', 'feeling', 'is', 'largely', 'a', 'product', 'of', 'Tory', 'press', 'scaremongering', '.', 'In', 'practice', 'most', 'people', 'wo', "n't", 'give', 'a', 'toss', '!']
To split text based on punctuation and white spaces, NLTK provides the wordpunct_tokenize tokenizer. This will also tokenize the punctuation characters. This step is illustrated in the following code:
>>> samples_tw[20]
"@B0MBSKARE the anti-Scottish feeling is largely a product of Tory press scaremongering. In practice most people won't give a toss!"
>>> wordpunct_tokenize(samples_tw[20])
['@', 'B0MBSKARE', 'the', 'anti', '-', 'Scottish', 'feeling', 'is', 'largely', 'a', 'product', 'of', 'Tory', 'press', 'scaremongering', '.', 'In', 'practice', 'most', 'people', 'won', "'", 't', 'give', 'a', 'toss', '!']
As you can see, some of the words between hyphens are also tokenized as well as other punctuations mark, compared to the word_tokenize. We can build custom tokenizers using NLTK's regular expression tokenizer, as shown in the following code:
>>> from nltk import regexp_tokenize
>>> patn = '\w+'
>>> regexp_tokenize(samples_tw[20],patn)
['B0MBSKARE', 'the', 'anti', 'Scottish', 'feeling', 'is', 'largely', 'a', 'product', 'of', 'Tory', 'press', 'scaremongering', 'In', 'practice', 'most', 'people', 'won', 't', 'give', 'a', 'toss']
In the preceding code, we used a simple regular expression (regexp) to detect a word containing only alphanumeric characters. As another example, we will use a regular expression that detects words along with a few punctuation characters:
>>> patn = '\w+|[!,\-,]'
>>> regexp_tokenize(samples_tw[20],patn)
['B0MBSKARE', 'the', 'anti', '-', 'Scottish', 'feeling', 'is', 'largely', 'a', 'product', 'of', 'Tory', 'press', 'scaremongering', 'In', 'practice', 'most', 'people', 'won', 't', 'give', 'a', 'toss', '!']
By changing the regexp pattern to include punctuation marks, we were able to tokenize the characters in the result, which is apparent through the tokens ! and - being present in the resulting Python list.
- Web前端開發技術:HTML、CSS、JavaScript(第3版)
- Java EE框架整合開發入門到實戰:Spring+Spring MVC+MyBatis(微課版)
- Vue.js 3.x從入門到精通(視頻教學版)
- Java編程指南:基礎知識、類庫應用及案例設計
- 數據結構習題精解(C語言實現+微課視頻)
- H5頁面設計:Mugeda版(微課版)
- Visual C#通用范例開發金典
- Terraform:多云、混合云環境下實現基礎設施即代碼(第2版)
- 微前端設計與實現
- Web編程基礎:HTML5、CSS3、JavaScript(第2版)
- 程序員必會的40種算法
- 一步一步學Spring Boot:微服務項目實戰(第2版)
- Building Apple Watch Projects
- 菜鳥成長之路
- jQuery Essentials