小鸟斋藤飞鸟

書名： Haskell Data Analysis Cookbook
作者名： Nishant Shukla
本章字?jǐn)?shù)： 542字
更新時(shí)間： 2021-12-08 12:43:36

Deduplication of nonconflicting data items

Duplication is a common problem when collecting large amounts of data. In this recipe, we will combine similar records in a way that ensures no information is lost.

Getting ready

Create an input.csv file with repeated data:

How to do it...

Create a new file, which we will call Main.hs, and perform the following steps:

We will be using the CSV, Map, and Maybe packages:

import Text.CSV (parseCSV, Record)
import Data.Map (fromListWith)
import Control.Applicative ((<|>))

Define the Item data type corresponding to the CSV input:

data Item = Item   { name :: String
                   , color :: Maybe String
                   , cost :: Maybe Float
                   } deriving Show

Get each record from CSV and put them in a map by calling our doWork function:

main :: IO ()
main = do
  let fileName = "input.csv"
  input <- readFile fileName
  let csv = parseCSV fileName input
  either handleError doWork csv

If we're unable to parse CSV, print an error message; otherwise, define the doWork function that creates a map from an association list with a collision strategy defined by combine:
```
handleError = print

doWork :: [Record] -> IO ()
doWork csv = print $ 
             fromListWith combine $ 
             map parseToTuple csv
```

Use the <|> function from Control.Applicative to merge the nonconflicting fields:

combine :: Item -> Item -> Item

combine item1 item2 = 
    Item { name = name item1
         , color = color item1 <|> color item2
         , cost = cost item1 <|> cost item2 }

Define the helper functions to create an association list from a CSV record:

parseToTuple :: [String] -> (String, Item)
parseToTuple record = (name item, item) 
    where item = parseItem record


parseItem :: Record -> Item
parseItem record = 
    Item { name = record !! 0
      , color = record !! 1
      , cost = case reads(record !! 2)::[(Float,String)] of
        [(c, "")] -> Just c
        _ -> Nothing  }

Executing the code shows a map filled with combined results:

$ runhaskell Main.hs

fromList 
[ ("glasses",
 Item {name = "glasses", color = "black", cost = Just 60.0})
, ("jacket",
 Item {name = "jacket", color = "brown", cost = Just 89.99})
, ("shirt",
 Item {name = "shirt", color = "red", cost = Just 15.0})
]

How it works...

The Map data type offers a convenient function fromListWith :: Ord k => (a -> a -> a) -> [(k, a)] -> Map k a to easily combine data in the map. We use it to find out whether a key already exists. If so, we combine the fields in the old and new items and store them under the key.

The true hero in this recipe is the <|> function form Control.Applicative. The <|> function takes its arguments and returns the first one that is not empty. Since both String and Maybe implement Applicative typeclass, we can reuse the <|> function for a more manageable code. Here are a couple of examples of it in use:

$ ghci

Prelude> import Control.Applicative

Prelude Control.Applicative> (Nothing) <|> (Just 1)
Just 1

Prelude Control.Applicative> (Just 'a') <|> (Just 'b')
Just 'a'

Prelude Control.Applicative> "" <|> "hello"
"hello"

Prelude Control.Applicative> "" <|> ""
""

There's more...

If you're dealing with larger numbers, it may be wise to use Data.Hashmap.Map instead because the running time for n items is O(min(n, W)), where W is the number of bits in an integer (32 or 64).

For even better performance, Data.Hashtable.Hashtable provides O(1) performance for lookups but adds complexity by being in an I/O monad.

官术网_书友最值得收藏!

Deduplication of nonconflicting data items

Getting ready

How to do it...

How it works...

There's more...

See also

官术网_书友最值得收藏!

Haskell Data Analysis Cookbook

Deduplication of nonconflicting data items

Getting ready

How to do it...

How it works...

There's more...

See also