- Haskell Data Analysis Cookbook
- Nishant Shukla
- 542字
- 2021-12-08 12:43:36
Deduplication of nonconflicting data items
Duplication is a common problem when collecting large amounts of data. In this recipe, we will combine similar records in a way that ensures no information is lost.
Getting ready
Create an input.csv
file with repeated data:

How to do it...
Create a new file, which we will call Main.hs
, and perform the following steps:
- We will be using the
CSV
,Map
, andMaybe
packages:import Text.CSV (parseCSV, Record) import Data.Map (fromListWith) import Control.Applicative ((<|>))
- Define the
Item
data type corresponding to the CSV input:data Item = Item { name :: String , color :: Maybe String , cost :: Maybe Float } deriving Show
- Get each record from CSV and put them in a map by calling our
doWork
function:main :: IO () main = do let fileName = "input.csv" input <- readFile fileName let csv = parseCSV fileName input either handleError doWork csv
- If we're unable to parse CSV, print an error message; otherwise, define the
doWork
function that creates a map from an association list with a collision strategy defined bycombine
:handleError = print doWork :: [Record] -> IO () doWork csv = print $ fromListWith combine $ map parseToTuple csv
- Use the
<|>
function fromControl.Applicative
to merge the nonconflicting fields:combine :: Item -> Item -> Item combine item1 item2 = Item { name = name item1 , color = color item1 <|> color item2 , cost = cost item1 <|> cost item2 }
- Define the helper functions to create an association list from a CSV record:
parseToTuple :: [String] -> (String, Item) parseToTuple record = (name item, item) where item = parseItem record parseItem :: Record -> Item parseItem record = Item { name = record !! 0 , color = record !! 1 , cost = case reads(record !! 2)::[(Float,String)] of [(c, "")] -> Just c _ -> Nothing }
- Executing the code shows a map filled with combined results:
$ runhaskell Main.hs fromList [ ("glasses", Item {name = "glasses", color = "black", cost = Just 60.0}) , ("jacket", Item {name = "jacket", color = "brown", cost = Just 89.99}) , ("shirt", Item {name = "shirt", color = "red", cost = Just 15.0}) ]
How it works...
The Map
data type offers a convenient function fromListWith :: Ord k => (a -> a -> a) -> [(k, a)] -> Map k a
to easily combine data in the map. We use it to find out whether a key already exists. If so, we combine the fields in the old and new items and store them under the key.
The true hero in this recipe is the <|>
function form Control.Applicative
. The <|>
function takes its arguments and returns the first one that is not empty. Since both String
and Maybe
implement Applicative typeclass
, we can reuse the <|>
function for a more manageable code. Here are a couple of examples of it in use:
$ ghci Prelude> import Control.Applicative Prelude Control.Applicative> (Nothing) <|> (Just 1) Just 1 Prelude Control.Applicative> (Just 'a') <|> (Just 'b') Just 'a' Prelude Control.Applicative> "" <|> "hello" "hello" Prelude Control.Applicative> "" <|> "" ""
There's more...
If you're dealing with larger numbers, it may be wise to use Data.Hashmap.Map
instead because the running time for n items is O(min(n, W)), where W is the number of bits in an integer (32 or 64).
For even better performance, Data.Hashtable.Hashtable
provides O(1) performance for lookups but adds complexity by being in an I/O monad.
See also
If the corpus contains inconsistent information about duplicated data, see the next recipe on Deduplication of conflicting data items.