- Haskell Data Analysis Cookbook
- Nishant Shukla
- 685字
- 2021-12-08 12:43:39
Searching a string using the Boyer-Moore-Horspool algorithm
When searching for a pattern in a string, we refer to the pattern as the needle and the whole corpus as the haystack. The Horspool string search algorithm implemented in this recipe performs well for almost all pattern lengths and alphabet sizes, but is ideal for large alphabet sizes and large needle sizes. Empirical benchmarks can be found by navigating to the following URL:
http://orion.lcg.ufrj.br/Dr.Dobbs/books/book5/chap10.htm
By preprocessing the query, the algorithm is able to efficiently skip redundant comparisons. In this recipe, we will implement a simplified version called Horspool's algorithm, which achieves the same average best case as the Boyer-Moore algorithm, benefits from having a smaller overhead cost, but may in very rare circumstances suffer the same worst-case running time as the naive search when the algorithm performs too many matches. The Boyer-Moore algorithms should only be used if the extra prepossessing time and space required are acceptable.
How to do it...
- We will be using a couple
Data.Map
functions as follows:import Data.Map (fromList, (!), findWithDefault)
- For convenience, define tuples representing character indices as follows:
indexMap xs = fromList $ zip [0..] xs revIndexMap xs = fromList $ zip (reverse xs) [0..]
- Define the search algorithm to use the recursive
bmh'
function as follows:bmh :: Ord a => [a] -> [a] -> Maybe Int bmh pat xs = bmh' (length pat - 1) (reverse pat) xs pat
- Recursively find the pattern in the current index until the index moves past the length of the string, as shown in the following code snippet:
bmh' :: Ord a => Int -> [a] -> [a] -> [a] -> Maybe Int bmh' n [] xs pat = Just (n + 1) bmh' n (p:ps) xs pat | n >= length xs = Nothing | p == (indexMap xs) ! n = bmh' (n - 1) ps xs pat | otherwise = bmh' (n + findWithDefault (length pat) (sMap ! n) pMap) (reverse pat) xs pat where sMap = indexMap xs pMap = revIndexMap pat
- Test out the function as follows:
main :: IO () main = print $ bmh "Wor" "Hello World"
- The following printed output displays the first index of the matching substring:
Just 6
How it works...
This algorithm compares the desired pattern to a moving window through the text. The efficiency comes from how quickly the moving window shifts left to right through this text. In the Horspool algorithm, the query is compared to the current window character by character from right to left, and the window shifts by the size of the query in the best case.
Another version of the Horspool algorithm designed by Remco Niemeijer can be found at http://bonsaicode.wordpress.com/2009/08/29/programming-praxis-string-search-boyer-moore.
There's more...
The Boyer-Moore algorithm ensures a faster worst case, but also endures slightly more initial overhead. Refer to the following commands to use the Boyer-Moore algorithm from the Data.ByteString.Search
package:
$ cabal install stringsearch
Import the following libraries:
import Data.ByteString.Search import qualified Data.ByteString.Char8 as C
Feed two ByteString
types to the indices
function to run the search as follows:
main = print $ indices (C.pack "abc") (C.pack "bdeabcdabc")
This will print out the following indices:
[3,7]
By benchmarking the performance of this library, we can see that longer search needles really improve runtime. We modify the code to search through a huge corpus of words from a file called big.txt
to find multiple needles. Here, we use the deepseq
function to force evaluation, so Haskell's lazy nature won't ignore it, as shown in the following code:
shortNeedles = ["abc", "cba"] longNeedles = ["very big words", "some long string"] main = do corpus <- BS.readFile "big.txt" map (\x -> (not.null) (indices x corpus)) shortNeedles 'deepseq' return ()
We can compile this code with special runtime system (RTS) control for easy profiling as follows:
$ ghc -O2 Main.hs –rtsopts $ ./Main +RTS -sstder
We use the text from norvig.com/big.txt
as our corpus. Searching for 25 long needles takes just about 0.06 seconds; however, searching for 25 short needles takes a sluggish 0.19 seconds.
See also
For another efficient string searching algorithm, refer to the Searching a string using the Rabin-Karp algorithm recipe.
- Vue 3移動(dòng)Web開(kāi)發(fā)與性能調(diào)優(yōu)實(shí)戰(zhàn)
- 案例式C語(yǔ)言程序設(shè)計(jì)
- JavaScript全程指南
- Docker技術(shù)入門(mén)與實(shí)戰(zhàn)(第3版)
- Cross-platform Desktop Application Development:Electron,Node,NW.js,and React
- AutoCAD VBA參數(shù)化繪圖程序開(kāi)發(fā)與實(shí)戰(zhàn)編碼
- 大學(xué)計(jì)算機(jī)基礎(chǔ)實(shí)驗(yàn)指導(dǎo)
- Yii Project Blueprints
- 速學(xué)Python:程序設(shè)計(jì)從入門(mén)到進(jìn)階
- Visual Foxpro 9.0數(shù)據(jù)庫(kù)程序設(shè)計(jì)教程
- 21天學(xué)通C++(第5版)
- 機(jī)器學(xué)習(xí)微積分一本通(Python版)
- Mastering Elasticsearch(Second Edition)
- Getting Started with Nano Server
- Clojure for Java Developers