官术网_书友最值得收藏!

Deduplication of nonconflicting data items

Duplication is a common problem when collecting large amounts of data. In this recipe, we will combine similar records in a way that ensures no information is lost.

Getting ready

Create an input.csv file with repeated data:

How to do it...

Create a new file, which we will call Main.hs, and perform the following steps:

  1. We will be using the CSV, Map, and Maybe packages:
    import Text.CSV (parseCSV, Record)
    import Data.Map (fromListWith)
    import Control.Applicative ((<|>))
  2. Define the Item data type corresponding to the CSV input:
    data Item = Item   { name :: String
                       , color :: Maybe String
                       , cost :: Maybe Float
                       } deriving Show
  3. Get each record from CSV and put them in a map by calling our doWork function:
    main :: IO ()
    main = do
      let fileName = "input.csv"
      input <- readFile fileName
      let csv = parseCSV fileName input
      either handleError doWork csv
  4. If we're unable to parse CSV, print an error message; otherwise, define the doWork function that creates a map from an association list with a collision strategy defined by combine:
    handleError = print
    
    doWork :: [Record] -> IO ()
    doWork csv = print $ 
                 fromListWith combine $ 
                 map parseToTuple csv
  5. Use the <|> function from Control.Applicative to merge the nonconflicting fields:
    combine :: Item -> Item -> Item
    
    combine item1 item2 = 
        Item { name = name item1
             , color = color item1 <|> color item2
             , cost = cost item1 <|> cost item2 }
  6. Define the helper functions to create an association list from a CSV record:
    parseToTuple :: [String] -> (String, Item)
    parseToTuple record = (name item, item) 
        where item = parseItem record
    
    
    parseItem :: Record -> Item
    parseItem record = 
        Item { name = record !! 0
          , color = record !! 1
          , cost = case reads(record !! 2)::[(Float,String)] of
            [(c, "")] -> Just c
            _ -> Nothing  }
  7. Executing the code shows a map filled with combined results:
    $ runhaskell Main.hs
    
    fromList 
    [ ("glasses",
     Item {name = "glasses", color = "black", cost = Just 60.0})
    , ("jacket",
     Item {name = "jacket", color = "brown", cost = Just 89.99})
    , ("shirt",
     Item {name = "shirt", color = "red", cost = Just 15.0})
    ]
    

How it works...

The Map data type offers a convenient function fromListWith :: Ord k => (a -> a -> a) -> [(k, a)] -> Map k a to easily combine data in the map. We use it to find out whether a key already exists. If so, we combine the fields in the old and new items and store them under the key.

The true hero in this recipe is the <|> function form Control.Applicative. The <|> function takes its arguments and returns the first one that is not empty. Since both String and Maybe implement Applicative typeclass, we can reuse the <|> function for a more manageable code. Here are a couple of examples of it in use:

$ ghci

Prelude> import Control.Applicative

Prelude Control.Applicative> (Nothing) <|> (Just 1)
Just 1

Prelude Control.Applicative> (Just 'a') <|> (Just 'b')
Just 'a'

Prelude Control.Applicative> "" <|> "hello"
"hello"

Prelude Control.Applicative> "" <|> ""
""

There's more...

If you're dealing with larger numbers, it may be wise to use Data.Hashmap.Map instead because the running time for n items is O(min(n, W)), where W is the number of bits in an integer (32 or 64).

For even better performance, Data.Hashtable.Hashtable provides O(1) performance for lookups but adds complexity by being in an I/O monad.

See also

If the corpus contains inconsistent information about duplicated data, see the next recipe on Deduplication of conflicting data items.

主站蜘蛛池模板: 砚山县| 阜阳市| 恭城| 剑河县| 公安县| 岳普湖县| 肃南| 乌兰县| 铜梁县| 垣曲县| 镇巴县| 蓝山县| 江山市| 行唐县| 湘阴县| 龙井市| 平谷区| 惠水县| 龙陵县| 三都| 东莞市| 吉木乃县| 涪陵区| 读书| 苍南县| 蒙阴县| 聊城市| 河南省| 吴川市| 济宁市| 寿光市| 磐安县| 定襄县| 七台河市| 新建县| 平陆县| 且末县| 永丰县| 池州市| 天气| 昌吉市|