官术网_书友最值得收藏!

Tokenizing sentences using regular expressions

Regular expressions can be used if you want complete control over how to tokenize text. As regular expressions can get complicated very quickly, I only recommend using them if the word tokenizers covered in the previous recipe are unacceptable.

Getting ready

First you need to decide how you want to tokenize a piece of text as this will determine how you construct your regular expression. The choices are:

  • Match on the tokens
  • Match on the separators or gaps

We'll start with an example of the first, matching alphanumeric tokens plus single quotes so that we don't split up contractions.

How to do it...

We'll create an instance of RegexpTokenizer, giving it a regular expression string to use for matching tokens:

>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer("[\w']+")
>>> tokenizer.tokenize("Can't is a contraction.")
["Can't", 'is', 'a', 'contraction']

There's also a simple helper function you can use if you don't want to instantiate the class, as shown in the following code:

>>> from nltk.tokenize import regexp_tokenize
>>> regexp_tokenize("Can't is a contraction.", "[\w']+")
["Can't", 'is', 'a', 'contraction']

Now we finally have something that can treat contractions as whole words, instead of splitting them into tokens.

How it works...

The RegexpTokenizer class works by compiling your pattern, then calling re.findall() on your text. You could do all this yourself using the re module, but RegexpTokenizer implements the TokenizerI interface, just like all the word tokenizers from the previous recipe. This means it can be used by other parts of the NLTK package, such as corpus readers, which we'll cover in detail in Chapter 3, Creating Custom Corpora. Many corpus readers need a way to tokenize the text they're reading, and can take optional keyword arguments specifying an instance of a TokenizerI subclass. This way, you have the ability to provide your own tokenizer instance if the default tokenizer is unsuitable.

There's more...

RegexpTokenizer can also work by matching the gaps, as opposed to the tokens. Instead of using re.findall(), the RegexpTokenizer class will use re.split(). This is how the BlanklineTokenizer class in nltk.tokenize is implemented.

Simple whitespace tokenizer

The following is a simple example of using RegexpT okenizer to tokenize on whitespace:

>>> tokenizer = RegexpTokenizer('\s+', gaps=True)
>>> tokenizer.tokenize("Can't is a contraction.")
["Can't", 'is', 'a', 'contraction.']

Notice that punctuation still remains in the tokens. The gaps=True parameter means that the pattern is used to identify gaps to tokenize on. If we used gaps=False, then the pattern would be used to identify tokens.

See also

For simpler word tokenization, see the previous recipe.

主站蜘蛛池模板: 兖州市| 海丰县| 中方县| 元朗区| 彭泽县| 邹城市| 平阴县| 苏尼特左旗| 辽宁省| 克什克腾旗| 天等县| 梁山县| 红河县| 京山县| 吴堡县| 彭山县| 廊坊市| 广东省| 义马市| 彭泽县| 炎陵县| 德格县| 阿克苏市| 石渠县| 开鲁县| 鄂托克旗| 阳信县| 安图县| 陕西省| 奎屯市| 永吉县| 嘉义县| 虹口区| 张家界市| 射阳县| 英超| 沾益县| 偃师市| 吴川市| 龙井市| 张家口市|