- Python Machine Learning By Example
- Yuxi (Hayden) Liu
- 397字
- 2021-07-02 12:41:39
Tokenization
Given a text sequence, tokenization is the task of breaking it into fragments, which can be words, characters, or sentences. Sometimes, certain characters are usually removed, such as punctuation marks, digits, and emoticons. These fragments are the so-called tokens used for further processing. Moreover, tokens composed of one word are also called unigrams in computational linguistics; bigrams are composed of two consecutive words; trigrams of three consecutive words; and n-grams of n consecutive words. Here is an example of tokenization:

We can implement word-based tokenization using the word_tokenize function in NLTK. We will use the input text '''I am reading a book., and in the next line, It is Python Machine Learning By Example,, then 2nd edition.''', as an example as shown in the following commands:
>>> from nltk.tokenize import word_tokenize
>>> sent = '''I am reading a book.
... It is Python Machine Learning By Example,
... 2nd edition.'''
>>> print(word_tokenize(sent))
['I', 'am', 'reading', 'a', 'book', '.', 'It', 'is', 'Python', 'Machine', 'Learning', 'By', 'Example', ',', '2nd', 'edition', '.']
Word tokens are obtained.
You might think word tokenization is simply splitting a sentence by space and punctuation. Here's an interesting example showing that tokenization is more complex than you think:
>>> sent2 = 'I have been to U.K. and U.S.A.'
>>> print(word_tokenize(sent2))
['I', 'have', 'been', 'to', 'U.K.', 'and', 'U.S.A', '.']
The tokenizer accurately recognizes the words 'U.K.' and 'U.S.A' as tokens instead of 'U' and '.' followed by 'K', for example.
SpaCy also has an outstanding tokenization feature. It uses an accurately trained model that is constantly updated. To install it, we can run the following command:
python -m spacy download en_core_web_sm
Then, we'll load the en_core_web_sm model and parse the sentence using this model:
>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> tokens2 = nlp(sent2)
>>> print([token.text for token in tokens2])
['I', 'have', 'been', 'to', 'U.K.', 'and', 'U.S.A.']
We can also segment text based on sentence. For example, on the same input text, using the sent_tokenize function from NLTK, we have the following commands:
>>> from nltk.tokenize import sent_tokenize
>>> print(sent_tokenize(sent))
['I am reading a book.', '...', 'It's Python Machine Learning By Example,\n... 2nd edition.']
Two sentence-based tokens are returned, as there are two sentences in the input text regardless of a newline following a comma.
- 輕松學(xué)C語言
- Learning Microsoft Azure Storage
- 網(wǎng)上沖浪
- 計(jì)算機(jī)應(yīng)用基礎(chǔ)·基礎(chǔ)模塊
- 走入IBM小型機(jī)世界
- 并行數(shù)據(jù)挖掘及性能優(yōu)化:關(guān)聯(lián)規(guī)則與數(shù)據(jù)相關(guān)性分析
- TIBCO Spotfire:A Comprehensive Primer(Second Edition)
- 自動(dòng)檢測(cè)與傳感技術(shù)
- C語言開發(fā)技術(shù)詳解
- 大學(xué)C/C++語言程序設(shè)計(jì)基礎(chǔ)
- 從零開始學(xué)SQL Server
- 精通LabVIEW程序設(shè)計(jì)
- SQL Server數(shù)據(jù)庫(kù)應(yīng)用基礎(chǔ)(第2版)
- Learning Cassandra for Administrators
- Hands-On DevOps