官术网_书友最值得收藏!

Removing repeating characters

In everyday language, people are often not strictly grammatical. They will write things such as I looooooove it in order to emphasize the word love. However, computers don't know that "looooooove" is a variation of "love" unless they are told. This recipe presents a method to remove these annoying repeating characters in order to end up with a proper English word.

Getting ready

As in the previous recipe, we will be making use of the re module, and more specifically, backreferences. A backreference is a way to refer to a previously matched group in a regular expression. This will allow us to match and remove repeating characters.

How to do it...

We will create a class that has the same form as the RegexpReplacer class from the previous recipe. It will have a replace() method that takes a single word and returns a more correct version of that word, with the dubious repeating characters removed. This code can be found in replacers.py in the book's code bundle and is meant to be imported:

import re

class RepeatReplacer(object):
  def __init__(self):
    self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
    self.repl = r'\1\2\3'

  def replace(self, word):
    repl_word = self.repeat_regexp.sub(self.repl, word)

    if repl_word != word:
      return self.replace(repl_word)
    else:
      return repl_word

And now some example use cases:

>>> from replacers import RepeatReplacer
>>> replacer = RepeatReplacer()
>>> replacer.replace('looooove')
'love'
>>> replacer.replace('oooooh')
'oh'
>>> replacer.replace('goose')
'gose'

How it works...

The RepeatReplacer class starts by compiling a regular expression to match and define a replacement string with backreferences. The repeat_regexp pattern matches three groups:

  • 0 or more starting characters (\w*)
  • A single character (\w) that is followed by another instance of that character (\2)
  • 0 or more ending characters (\w*)

The replacement string is then used to keep all the matched groups, while discarding the backreference to the second group. So, the word looooove gets split into (looo)(o)o(ve) and then recombined as loooove, discarding the last o. This continues until only one o remains, when repeat_regexp no longer matches the string and no more characters are removed.

There's more...

In the preceding examples, you can see that the RepeatReplacer class is a bit too greedy and ends up changing goose into gose. To correct this issue, we can augment the replace() function with a WordNet lookup. If WordNet recognizes the word, then we can stop replacing characters. Here is the WordNet-augmented version:

import re
from nltk.corpus import wordnet

class RepeatReplacer(object):
  def __init__(self):
    self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
    self.repl = r'\1\2\3'

  def replace(self, word):
    if wordnet.synsets(word):
      return word
    repl_word = self.repeat_regexp.sub(self.repl, word)

    if repl_word != word:
      return self.replace(repl_word)
    else:
      return repl_word

Now, goose will be found in WordNet, and no character replacement will take place. Also, oooooh will become ooh instead of oh because ooh is actually a word in WordNet, defined as an expression of admiration or pleasure.

See also

Read the next recipe to learn how to correct misspellings. For more information on WordNet, refer to the WordNet recipes in Chapter 1, Tokenizing Text and WordNet Basics. We will also be using WordNet for antonym replacement later in this chapter.

主站蜘蛛池模板: 澎湖县| 石阡县| 海门市| 蒙自县| 米林县| 山阳县| 休宁县| 五峰| 都江堰市| 海兴县| 广灵县| 资兴市| 二连浩特市| 静乐县| 呈贡县| 饶阳县| 鄄城县| 册亨县| 尼勒克县| 南乐县| 澜沧| 奉贤区| 石棉县| 同心县| 抚远县| 政和县| 易门县| 西安市| 吉安县| 郓城县| 来凤县| 唐河县| 龙山县| 青州市| 德化县| 石台县| 潮安县| 桑植县| 凤阳县| 盐山县| 保山市|