官术网_书友最值得收藏!

Spelling correction with Enchant

Replacing repeating characters is actually an extreme form of spelling correction. In this recipe, we will take on the less extreme case of correcting minor spelling issues using Enchant—a spelling correction API.

Getting ready

You will need to install Enchant and a dictionary for it to use. Enchant is an offshoot of the AbiWord open source word processor, and more information on it can be found at http://www.abisource.com/projects/enchant/.

For dictionaries, Aspell is a good open source spellchecker and dictionary that can be found at http://aspell.net/.

Finally, you will need the PyEnchant library, which can be found at the following link: http://pythonhosted.org/pyenchant/

You should be able to install it with the easy_install command that comes with Python setuptools, such as by typing sudo easy_install pyenchant on Linux or Unix. On a Mac machine, PyEnchant may be difficult to install. If you have difficulties, consult http://pythonhosted.org/pyenchant/download.html.

How to do it...

We will create a new class called SpellingReplacer in replacers.py, and this time, the replace() method will check Enchant to see whether the word is valid. If not, we will look up the suggested alternatives and return the best match using nltk.metrics.edit_distance():

import enchant
from nltk.metrics import edit_distance

class SpellingReplacer(object):
  def __init__(self, dict_name='en', max_dist=2):
    self.spell_dict = enchant.Dict(dict_name)
    self.max_dist = max_dist

  def replace(self, word):
    if self.spell_dict.check(word):
      return word
    suggestions = self.spell_dict.suggest(word)

    if suggestions and edit_distance(word, suggestions[0]) <= 
      self.max_dist:
      return suggestions[0]
    else:
      return word

The preceding class can be used to correct English spellings, as follows:

>>> from replacers import SpellingReplacer
>>> replacer = SpellingReplacer()
>>> replacer.replace('cookbok')
'cookbook'

How it works...

The SpellingReplacer class starts by creating a reference to an Enchant dictionary. Then, in the replace() method, it first checks whether the given word is present in the dictionary. If it is, no spelling correction is necessary and the word is returned. If the word is not found, it looks up a list of suggestions and returns the first suggestion, as long as its edit distance is less than or equal to max_dist. The edit distance is the number of character changes necessary to transform the given word into the suggested word. The max_dist value then acts as a constraint on the Enchant suggest function to ensure that no unlikely replacement words are returned. Here is an example showing all the suggestions for languege, a misspelling of language:

>>> import enchant
>>> d = enchant.Dict('en')
>>> d.suggest('languege')
['language', 'languages', 'languor', "language's"]

Except for the correct suggestion, language, all the other words have an edit distance of three or greater. You can try this yourself with the following code:

>>> from nltk.metrics import edit_distance
>>> edit_distance('language', 'languege')
1
>>> edit_distance('language', 'languo')
3

There's more...

You can use language dictionaries other than en, such as en_GB, assuming the dictionary has already been installed. To check which other languages are available, use enchant.list_languages():

>>> enchant.list_languages()
['en', 'en_CA', 'en_GB', 'en_US']

Tip

If you try to use a dictionary that doesn't exist, you will get enchant.DictNotFoundError. You can first check whether the dictionary exists using enchant.dict_exists(), which will return True if the named dictionary exists, or False otherwise.

The en_GB dictionary

Always ensure that you use the correct dictionary for whichever language you are performing spelling correction on. The en_US dictionary can give you different results than en_GB, such as for the word theater. The word theater is the American English spelling whereas the British English spelling is theatre:

>>> import enchant
>>> dUS = enchant.Dict('en_US')
>>> dUS.check('theater')
True
>>> dGB = enchant.Dict('en_GB')
>>> dGB.check('theater')
False
>>> from replacers import SpellingReplacer
>>> us_replacer = SpellingReplacer('en_US')
>>> us_replacer.replace('theater')
'theater'
>>> gb_replacer = SpellingReplacer('en_GB')
>>> gb_replacer.replace('theater')
'theatre'

Personal word lists

Enchant also supports personal word lists. These can be combined with an existing dictionary, allowing you to augment the dictionary with your own words. So, let's say you had a file named mywords.txt that had nltk on one line. You could then create a dictionary augmented with your personal word list as follows:

>>> d = enchant.Dict('en_US')
>>> d.check('nltk')
False
>>> d = enchant.DictWithPWL('en_US', 'mywords.txt')
>>> d.check('nltk')
True

To use an augmented dictionary with our SpellingReplacer class, we can create a subclass in replacers.py that takes an existing spelling dictionary:

class CustomSpellingReplacer(SpellingReplacer):
  def __init__(self, spell_dict, max_dist=2):
    self.spell_dict = spell_dict
    self.max_dist = max_dist

This CustomSpellingReplacer class will not replace any words that you put into mywords.txt:

>>> from replacers import CustomSpellingReplacer
>>> d = enchant.DictWithPWL('en_US', 'mywords.txt')
>>> replacer = CustomSpellingReplacer(d)
>>> replacer.replace('nltk')
'nltk'

See also

The previous recipe covered an extreme form of spelling correction by replacing repeating characters. You can also perform spelling correction by simple word replacement as discussed in the next recipe.

主站蜘蛛池模板: 额尔古纳市| 家居| 沅江市| 聊城市| 白朗县| 威宁| 民勤县| 桐柏县| 新津县| 玉林市| 江永县| 浠水县| 嘉峪关市| 南岸区| 浙江省| 葵青区| 甘肃省| 旌德县| 汽车| 南阳市| 临城县| 白玉县| 拉孜县| 京山县| 昌平区| 井研县| 五大连池市| 勐海县| 拜城县| 寿阳县| 汉川市| 乡城县| 三穗县| 大安市| 海林市| 舒城县| 嘉兴市| 云林县| 通辽市| 合肥市| 巢湖市|