- Haskell Data Analysis Cookbook
- Nishant Shukla
- 433字
- 2021-12-08 12:43:35
Ignoring punctuation and specific characters
Usually in natural language processing, some uninformative words or characters, called stop words, can be filtered out for easier handling. When computing word frequencies or extracting sentiment data from a corpus, punctuation or special characters might need to be ignored. This recipe demonstrates how to remove these specific characters from the body of a text.
How to do it...
There are no imports necessary. Create a new file, which we will call Main.hs
, and perform the following steps:
- Implement
main
and define a string calledquote
. The back slashes (\
) represent multiline strings:main :: IO () main = do let quote = "Deep Blue plays very good chess-so what?\ \Does that tell you something about how we play chess?\ \No. Does it tell you about how Kasparov envisions,\ \understands a chessboard? (Douglas Hofstadter)" putStrLn $ (removePunctuation.replaceSpecialSymbols) quote
- Replace all punctuation marks with an empty string, and replace all special symbols with a space:
punctuations = [ '!', '"', '#', '$', '%' , '(', ')', '.', ',', '?'] removePunctuation = filter (`notElem` punctuations) specialSymbols = ['/', '-'] replaceSpecialSymbols = map $ (\c ->if c `elem` specialSymbols then ' ' else c)
- By running the code, we will find that all special characters and punctuation are appropriately removed to facilitate dealing with the text's corpus:
$ runhaskell Main.hs Deep Blue plays very good chess so what Does that tell you something about how we play chess No Does it tell you about how Kasparov envisions understands a chessboard Douglas Hofstadter
There's more...
For more powerful control, we can install MissingH
, which is a very helpful utility we can use to deal with strings:
$ cabal install MissingH
It provides a replace
function that takes three arguments and produces a result as follows:
Prelude> replace "hello" "goodbye" "hello world!" "goodbye world!"
It replaces all occurrences of the first string with the second string in the third argument. We can also compose multiple replace
functions:
Prelude> ((replace "," "").(replace "!" "")) "hello, world!" "hello world"
By folding the composition (.)
function over a list of these replace
functions, we can generalize the replace
function to an arbitrary list of tokens:
Prelude> (foldr (.) id $ map (flip replace "") [",", "!"]) "hello, world!" "hello world"
The list of punctuation marks can now be arbitrarily long. We can modify our recipe to use our new and more generalized functions:
removePunctuation = foldr (.) id $ map (flip replace "") ["!", "\"", "#", "$", "%", "(", ")", ".", ",", "?"] replaceSpecialSymbols = foldr (.) id $ map (flip replace " ") ["/", "-"]
- Java程序設計與開發
- GraphQL學習指南
- Java Web程序設計
- Ray分布式機器學習:利用Ray進行大模型的數據處理、訓練、推理和部署
- Hands-On Functional Programming with TypeScript
- Kotlin開發教程(全2冊)
- 區塊鏈技術進階與實戰(第2版)
- Red Hat Enterprise Linux Troubleshooting Guide
- Laravel Application Development Blueprints
- 軟件工程與UML案例解析(第三版)
- Microsoft Exchange Server 2016 PowerShell Cookbook(Fourth Edition)
- 用Python動手學統計學
- Web程序設計與架構
- Python深度學習:基于PyTorch
- The Java Workshop