- Python 3 Text Processing with NLTK 3 Cookbook
- Jacob Perkins
- 591字
- 2021-09-03 09:45:37
Replacing words matching regular expressions
Now, we are going to get into the process of replacing words. If stemming and lemmatization are a kind of linguistic compression, then word replacement can be thought of as error correction or text normalization.
In this recipe, we will replace words based on regular expressions, with a focus on expanding contractions. Remember when we were tokenizing words in Chapter 1, Tokenizing Text and WordNet Basics, and it was clear that most tokenizers had trouble with contractions? This recipe aims to fix this by replacing contractions with their expanded forms, for example, by replacing "can't" with "cannot" or "would've" with "would have".
Getting ready
Understanding how this recipe works will require a basic knowledge of regular expressions and the re
module. The key things to know are matching patterns and the re.sub()
function.
How to do it...
First, we need to define a number of replacement patterns. This will be a list of tuple pairs, where the first element is the pattern to match with and the second element is the replacement.
Next, we will create a RegexpReplacer
class that will compile the patterns and provide a replace()
method to substitute all the found patterns with their replacements.
The following code can be found in the replacers.py
module in the book's code bundle and is meant to be imported, not typed into the console:
import re replacement_patterns = [ (r'won\'t', 'will not'), (r'can\'t', 'cannot'), (r'i\'m', 'i am'), (r'ain\'t', 'is not'), (r'(\w+)\'ll', '\g<1> will'), (r'(\w+)n\'t', '\g<1> not'), (r'(\w+)\'ve', '\g<1> have'), (r'(\w+)\'s', '\g<1> is'), (r'(\w+)\'re', '\g<1> are'), (r'(\w+)\'d', '\g<1> would') ] class RegexpReplacer(object): def __init__(self, patterns=replacement_patterns): self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns] def replace(self, text): s = text for (pattern, repl) in self.patterns: s = re.sub(pattern, repl, s) return s
How it works...
Here is a simple usage example:
>>> from replacers import RegexpReplacer >>> replacer = RegexpReplacer() >>> replacer.replace("can't is a contraction") 'cannot is a contraction' >>> replacer.replace("I should've done that thing I didn't do") 'I should have done that thing I did not do'
The RegexpReplacer.replace()
function works by replacing every instance of a replacement pattern with its corresponding substitution pattern. In replacement patterns, we have defined tuples such as r'(\w+)\'ve'
and '\g<1> have'
. The first element matches a group of ASCII characters followed by 've
. By grouping the characters before 've
in parenthesis, a match group is found and can be used in the substitution pattern with the \g<1>
reference. So, we keep everything before 've
, then replace 've
with the word h
ave. This is how should've
can become should have
.
There's more...
This replacement technique can work with any kind of regular expression, not just contractions. So, you can replace any occurrence of &
with and
, or eliminate all occurrences of -
by replacing it with an empty string. The RegexpReplacer
class can take any list of replacement patterns for whatever purpose.
Replacement before tokenization
Let's try using the RegexpReplacer
class as a preliminary step before tokenization:
>>> from nltk.tokenize import word_tokenize >>> from replacers import RegexpReplacer >>> replacer = RegexpReplacer() >>> word_tokenize("can't is a contraction") ['ca', "n't", 'is', 'a', 'contraction'] >>> word_tokenize(replacer.replace("can't is a contraction")) ['can', 'not', 'is', 'a', 'contraction']
Much better! By eliminating the contractions in the first place, the tokenizer will produce cleaner results. Cleaning up the text before processing is a common pattern in natural language processing.
See also
For more information on tokenization, see the first three recipes in Chapter 1, Tokenizing Text and WordNet Basics. For more replacement techniques, continue reading the rest of this chapter.
- HTML5+CSS3王者歸來
- Python快樂編程:人工智能深度學(xué)習(xí)基礎(chǔ)
- 計算機(jī)圖形學(xué)編程(使用OpenGL和C++)(第2版)
- Python for Secret Agents:Volume II
- Programming ArcGIS 10.1 with Python Cookbook
- 精通軟件性能測試與LoadRunner實(shí)戰(zhàn)(第2版)
- Lighttpd源碼分析
- D3.js By Example
- OpenGL Data Visualization Cookbook
- Microsoft 365 Certified Fundamentals MS-900 Exam Guide
- Zabbix Performance Tuning
- C++從入門到精通(第6版)
- 從Power BI到Analysis Services:企業(yè)級數(shù)據(jù)分析實(shí)戰(zhàn)
- C#面向?qū)ο蟪绦蛟O(shè)計(第2版)
- SQL Server 2012 數(shù)據(jù)庫應(yīng)用教程(第3版)