- Python 3 Text Processing with NLTK 3 Cookbook
- Jacob Perkins
- 751字
- 2021-09-03 09:45:37
Spelling correction with Enchant
Replacing repeating characters is actually an extreme form of spelling correction. In this recipe, we will take on the less extreme case of correcting minor spelling issues using Enchant—a spelling correction API.
Getting ready
You will need to install Enchant and a dictionary for it to use. Enchant is an offshoot of the AbiWord open source word processor, and more information on it can be found at http://www.abisource.com/projects/enchant/.
For dictionaries, Aspell is a good open source spellchecker and dictionary that can be found at http://aspell.net/.
Finally, you will need the PyEnchant library, which can be found at the following link: http://pythonhosted.org/pyenchant/
You should be able to install it with the easy_install
command that comes with Python setuptools, such as by typing sudo easy_install pyenchant
on Linux or Unix. On a Mac machine, PyEnchant may be difficult to install. If you have difficulties, consult http://pythonhosted.org/pyenchant/download.html.
How to do it...
We will create a new class called SpellingReplacer
in replacers.py
, and this time, the replace()
method will check Enchant to see whether the word is valid. If not, we will look up the suggested alternatives and return the best match using nltk.metrics.edit_distance()
:
import enchant from nltk.metrics import edit_distance class SpellingReplacer(object): def __init__(self, dict_name='en', max_dist=2): self.spell_dict = enchant.Dict(dict_name) self.max_dist = max_dist def replace(self, word): if self.spell_dict.check(word): return word suggestions = self.spell_dict.suggest(word) if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist: return suggestions[0] else: return word
The preceding class can be used to correct English spellings, as follows:
>>> from replacers import SpellingReplacer >>> replacer = SpellingReplacer() >>> replacer.replace('cookbok') 'cookbook'
How it works...
The SpellingReplacer
class starts by creating a reference to an Enchant dictionary. Then, in the replace()
method, it first checks whether the given word is present in the dictionary. If it is, no spelling correction is necessary and the word is returned. If the word is not found, it looks up a list of suggestions and returns the first suggestion, as long as its edit distance is less than or equal to max_dist
. The edit distance is the number of character changes necessary to transform the given word into the suggested word. The max_dist
value then acts as a constraint on the Enchant suggest
function to ensure that no unlikely replacement words are returned. Here is an example showing all the suggestions for languege
, a misspelling of language
:
>>> import enchant >>> d = enchant.Dict('en') >>> d.suggest('languege') ['language', 'languages', 'languor', "language's"]
Except for the correct suggestion, language
, all the other words have an edit distance of three or greater. You can try this yourself with the following code:
>>> from nltk.metrics import edit_distance >>> edit_distance('language', 'languege') 1 >>> edit_distance('language', 'languo') 3
There's more...
You can use language dictionaries other than en
, such as en_GB
, assuming the dictionary has already been installed. To check which other languages are available, use enchant.list_languages()
:
>>> enchant.list_languages() ['en', 'en_CA', 'en_GB', 'en_US']
Tip
If you try to use a dictionary that doesn't exist, you will get enchant.DictNotFoundError
. You can first check whether the dictionary exists using enchant.dict_exists()
, which will return True
if the named dictionary exists, or False
otherwise.
The en_GB dictionary
Always ensure that you use the correct dictionary for whichever language you are performing spelling correction on. The en_US
dictionary can give you different results than en_GB
, such as for the word theater
. The word theater
is the American English spelling whereas the British English spelling is theatre
:
>>> import enchant >>> dUS = enchant.Dict('en_US') >>> dUS.check('theater') True >>> dGB = enchant.Dict('en_GB') >>> dGB.check('theater') False >>> from replacers import SpellingReplacer >>> us_replacer = SpellingReplacer('en_US') >>> us_replacer.replace('theater') 'theater' >>> gb_replacer = SpellingReplacer('en_GB') >>> gb_replacer.replace('theater') 'theatre'
Personal word lists
Enchant also supports personal word lists. These can be combined with an existing dictionary, allowing you to augment the dictionary with your own words. So, let's say you had a file named mywords.txt
that had nltk
on one line. You could then create a dictionary augmented with your personal word list as follows:
>>> d = enchant.Dict('en_US') >>> d.check('nltk') False >>> d = enchant.DictWithPWL('en_US', 'mywords.txt') >>> d.check('nltk') True
To use an augmented dictionary with our SpellingReplacer
class, we can create a subclass in replacers.py
that takes an existing spelling dictionary:
class CustomSpellingReplacer(SpellingReplacer): def __init__(self, spell_dict, max_dist=2): self.spell_dict = spell_dict self.max_dist = max_dist
This CustomSpellingReplacer
class will not replace any words that you put into mywords.txt
:
>>> from replacers import CustomSpellingReplacer >>> d = enchant.DictWithPWL('en_US', 'mywords.txt') >>> replacer = CustomSpellingReplacer(d) >>> replacer.replace('nltk') 'nltk'
See also
The previous recipe covered an extreme form of spelling correction by replacing repeating characters. You can also perform spelling correction by simple word replacement as discussed in the next recipe.
- Learning Python Web Penetration Testing
- 嵌入式軟件系統測試:基于形式化方法的自動化測試解決方案
- 零基礎搭建量化投資系統:以Python為工具
- Hands-On Image Processing with Python
- Getting Started with CreateJS
- Scratch 3游戲與人工智能編程完全自學教程
- Ext JS 4 Web Application Development Cookbook
- 小程序,巧應用:微信小程序開發實戰(第2版)
- C/C++數據結構與算法速學速用大辭典
- 案例式C語言程序設計實驗指導
- Bootstrap for Rails
- 區塊鏈架構之美:從比特幣、以太坊、超級賬本看區塊鏈架構設計
- SEO教程:搜索引擎優化入門與進階(第3版)
- 從零開始學UI:概念解析、實戰提高、突破規則
- Java面試一戰到底(基礎卷)