官术网_书友最值得收藏!

Deduplication of nonconflicting data items

Duplication is a common problem when collecting large amounts of data. In this recipe, we will combine similar records in a way that ensures no information is lost.

Getting ready

Create an input.csv file with repeated data:

How to do it...

Create a new file, which we will call Main.hs, and perform the following steps:

  1. We will be using the CSV, Map, and Maybe packages:
    import Text.CSV (parseCSV, Record)
    import Data.Map (fromListWith)
    import Control.Applicative ((<|>))
  2. Define the Item data type corresponding to the CSV input:
    data Item = Item   { name :: String
                       , color :: Maybe String
                       , cost :: Maybe Float
                       } deriving Show
  3. Get each record from CSV and put them in a map by calling our doWork function:
    main :: IO ()
    main = do
      let fileName = "input.csv"
      input <- readFile fileName
      let csv = parseCSV fileName input
      either handleError doWork csv
  4. If we're unable to parse CSV, print an error message; otherwise, define the doWork function that creates a map from an association list with a collision strategy defined by combine:
    handleError = print
    
    doWork :: [Record] -> IO ()
    doWork csv = print $ 
                 fromListWith combine $ 
                 map parseToTuple csv
  5. Use the <|> function from Control.Applicative to merge the nonconflicting fields:
    combine :: Item -> Item -> Item
    
    combine item1 item2 = 
        Item { name = name item1
             , color = color item1 <|> color item2
             , cost = cost item1 <|> cost item2 }
  6. Define the helper functions to create an association list from a CSV record:
    parseToTuple :: [String] -> (String, Item)
    parseToTuple record = (name item, item) 
        where item = parseItem record
    
    
    parseItem :: Record -> Item
    parseItem record = 
        Item { name = record !! 0
          , color = record !! 1
          , cost = case reads(record !! 2)::[(Float,String)] of
            [(c, "")] -> Just c
            _ -> Nothing  }
  7. Executing the code shows a map filled with combined results:
    $ runhaskell Main.hs
    
    fromList 
    [ ("glasses",
     Item {name = "glasses", color = "black", cost = Just 60.0})
    , ("jacket",
     Item {name = "jacket", color = "brown", cost = Just 89.99})
    , ("shirt",
     Item {name = "shirt", color = "red", cost = Just 15.0})
    ]
    

How it works...

The Map data type offers a convenient function fromListWith :: Ord k => (a -> a -> a) -> [(k, a)] -> Map k a to easily combine data in the map. We use it to find out whether a key already exists. If so, we combine the fields in the old and new items and store them under the key.

The true hero in this recipe is the <|> function form Control.Applicative. The <|> function takes its arguments and returns the first one that is not empty. Since both String and Maybe implement Applicative typeclass, we can reuse the <|> function for a more manageable code. Here are a couple of examples of it in use:

$ ghci

Prelude> import Control.Applicative

Prelude Control.Applicative> (Nothing) <|> (Just 1)
Just 1

Prelude Control.Applicative> (Just 'a') <|> (Just 'b')
Just 'a'

Prelude Control.Applicative> "" <|> "hello"
"hello"

Prelude Control.Applicative> "" <|> ""
""

There's more...

If you're dealing with larger numbers, it may be wise to use Data.Hashmap.Map instead because the running time for n items is O(min(n, W)), where W is the number of bits in an integer (32 or 64).

For even better performance, Data.Hashtable.Hashtable provides O(1) performance for lookups but adds complexity by being in an I/O monad.

See also

If the corpus contains inconsistent information about duplicated data, see the next recipe on Deduplication of conflicting data items.

主站蜘蛛池模板: 花莲县| 平果县| 壶关县| 仁布县| 成安县| 满城县| 正阳县| 阳春市| 彭泽县| 凌源市| 承德县| 商南县| 东阳市| 阿合奇县| 延安市| 德清县| 定结县| 南丹县| 达拉特旗| 泸西县| 来安县| 揭西县| 新宾| 大庆市| 麻阳| 大安市| 岳阳市| 遵化市| 六安市| 新竹市| 客服| 营口市| 姜堰市| 竹溪县| 南川市| 自治县| 龙岩市| 古田县| 佳木斯市| 莲花县| 中阳县|