- Haskell Data Analysis Cookbook
- Nishant Shukla
- 559字
- 2021-12-08 12:43:35
Validating records by matching regular expressions
A regular expression is a language for matching patterns in a string. Our Haskell code can process a regular expression to examine a text and tell us whether or not it matches the rules described by the expression. Regular expression matching can be used to validate or identify a pattern in the text.
In this recipe, we will read a corpus of English text to find possible candidates of full names in a sea of words. Full names usually consist of two words that start with a capital letter. We use this heuristic to extract all the names from an article.
Getting ready
Create an input.txt
file with some text. In this example, we use a snippet from a New York Times article on dinosaurs (http://www.nytimes.com/2013/12/17/science/earth/outsider-challenges-papers-on-growth-of-dinosaurs.html)
Other co-authors of Dr. Erickson's include Mark Norell, chairman of paleontology at the American Museum of Natural History; Philip Currie, a professor of dinosaur paleobiology at the University of Alberta; and Peter Makovicky, associate curator of paleontology at the Field Museum in Chicago.
How to do it...
Create a new file, which we will call Main.hs
, and perform the following steps:
- Import the regular expression library:
import Text.Regex.Posix ((=~))
- Match a string against a regular expression to detect words that look like names:
looksLikeName :: String -> Bool looksLikeName str = str =~ "^[A-Z][a-z]{1,30}$" :: Bool
- Create functions that remove unnecessary punctuation and special symbols. We will use the same functions defined in the previous recipe entitled Ignoring punctuation and specific characters:
punctuations = [ '!', '"', '#', '$', '%' , '(', ')', '.', ',', '?'] removePunctuation = filter (`notElem` punctuations) specialSymbols = ['/', '-'] replaceSpecialSymbols = map $ (\c -> if c `elem` specialSymbols then ' ' else c)
- Pair adjacent words together and form a list of possible full names:
createTuples (x:y:xs) = (x ++ " " ++ y) : createTuples (y:xs) createTuples _ = []
- Retrieve the input and find possible names from a corpus of text:
main :: IO () main = do input <- readFile "input.txt" let cleanInput = (removePunctuation.replaceSpecialSymbols) input let wordPairs = createTuples $ words cleanInput let possibleNames = filter (all looksLikeName . words) wordPairs print possibleNames
- The resulting output after running the code is as follows:
$ runhaskell Main.hs ["Dr Erickson","Mark Norell","American Museum","Natural History","History Philip","Philip Currie","Peter Makovicky","Field Museum"]
How it works...
The =~
function takes in a string and a regular expression and returns a target that we parse as Bool
. In this recipe, the ^[A-Z][a-z]{1,30}$
regular expression matches the words that start with a capital letter and are between 2 and 31 letters long.
In order to determine the usefulness of the algorithm presented in this recipe, we will introduce two metrics of relevance: precision and recall. Precision is the percent of retrieved data that is relevant. Recall is the percent of relevant data that is retrieved.
Out of a total of 45 words in the input.txt
file, four correct names are produced and a total eight candidates are retrieved. It has a precision of 50 percent and a recall of 100 percent. This is not bad at all for a simple regular expression trick.
See also
Instead of running regular expressions on a string, we can pass them through a lexical analyzer. The next recipe entitled Lexing and parsing an e-mail address will cover this in detail.
- Java語言程序設(shè)計
- ExtGWT Rich Internet Application Cookbook
- PaaS程序設(shè)計
- Visual Basic程序設(shè)計與應(yīng)用實(shí)踐教程
- 微信公眾平臺開發(fā):從零基礎(chǔ)到ThinkPHP5高性能框架實(shí)踐
- Learning JavaScript Data Structures and Algorithms
- 運(yùn)用后端技術(shù)處理業(yè)務(wù)邏輯(藍(lán)橋杯軟件大賽培訓(xùn)教材-Java方向)
- jQuery炫酷應(yīng)用實(shí)例集錦
- Deep Learning with R Cookbook
- Python自然語言理解:自然語言理解系統(tǒng)開發(fā)與應(yīng)用實(shí)戰(zhàn)
- Web程序設(shè)計:ASP.NET(第2版)
- ASP.NET 4.0 Web程序設(shè)計
- Visual C++從入門到精通(第2版)
- 精益軟件開發(fā)管理之道
- HTML5+CSS+JavaScript深入學(xué)習(xí)實(shí)錄