官术网_书友最值得收藏!

Tokenization

Given a text sequence, tokenization is the task of breaking it into fragments, which can be words, characters, or sentences. Sometimes, certain characters are usually removed, such as punctuation marks, digits, and emoticons. These fragments are the so-called tokens used for further processing. Moreover, tokens composed of one word are also called unigrams in computational linguistics; bigrams are composed of two consecutive words; trigrams of three consecutive words; and n-grams of n consecutive words. Here is an example of tokenization:

We can implement word-based tokenization using the word_tokenize function in NLTK. We will use the input text '''I am reading a book., and in the next line, It is Python Machine Learning By Example,, then 2nd edition.''', as an example as shown in the following commands:

>>> from nltk.tokenize import word_tokenize
>>> sent = '''I am reading a book.
... It is Python Machine Learning By Example,
... 2nd edition.'''
>>> print(word_tokenize(sent))
['I', 'am', 'reading', 'a', 'book', '.', 'It', 'is', 'Python', 'Machine', 'Learning', 'By', 'Example', ',', '2nd', 'edition', '.']

Word tokens are obtained.

The word_tokenize function keeps punctuation marks and digits, and only discards whitespaces and newlines.

You might think word tokenization is simply splitting a sentence by space and punctuation. Here's an interesting example showing that tokenization is more complex than you think:

>>> sent2 = 'I have been to U.K. and U.S.A.'
>>> print(word_tokenize(sent2))
['I', 'have', 'been', 'to', 'U.K.', 'and', 'U.S.A', '.']

The tokenizer accurately recognizes the words 'U.K.' and 'U.S.A' as tokens instead of 'U' and '.' followed by 'K', for example.

SpaCy also has an outstanding tokenization feature. It uses an accurately trained model that is constantly updated. To install it, we can run the following command:

python -m spacy download en_core_web_sm

Then, we'll load the en_core_web_sm model and parse the sentence using this model:

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> tokens2 = nlp(sent2)
>>> print([token.text for token in tokens2])
['I', 'have', 'been', 'to', 'U.K.', 'and', 'U.S.A.']

We can also segment text based on sentence. For example, on the same input text, using the sent_tokenize function from NLTK, we have the following commands:

>>> from nltk.tokenize import sent_tokenize
>>> print(sent_tokenize(sent))
['I am reading a book.', '...', 'It's Python Machine Learning By Example,\n... 2nd edition.']

Two sentence-based tokens are returned, as there are two sentences in the input text regardless of a newline following a comma.

主站蜘蛛池模板: 边坝县| 泰州市| 晋江市| 柳林县| 新平| 武隆县| 梁平县| 凌海市| 措美县| 望江县| 呼伦贝尔市| 延寿县| 泽州县| 礼泉县| 社旗县| 延边| 华宁县| 东阳市| 合阳县| 横峰县| 当涂县| 绵阳市| 大城县| 长岛县| 共和县| 大名县| 禹州市| 思茅市| 灵武市| 龙海市| 恩平市| 江达县| 得荣县| 仲巴县| 雷波县| 津南区| 临汾市| 饶阳县| 阿图什市| 侯马市| 镇康县|