官术网_书友最值得收藏!

Deduplication of conflicting data items

Unfortunately, information about an item may be inconsistent throughout the corpus. Collision strategies are often domain-dependent, but one common way to manage this conflict is by simply storing all variations of the data. In this recipe, we will read a CSV file that contains information about musical artists and store all of the information about their songs and genres in a set.

Getting ready

Create a CSV input file with the following musical artists. The first column is for the name of the artist or band. The second column is the song name, and the third is the genre. Notice how some musicians have multiple songs or genres.

How to do it...

Create a new file, which we will call Main.hs, and perform the following steps:

  1. We will be using the CSV, Map, and Set packages:
    import Text.CSV (parseCSV, Record)
    import Data.Map (fromListWith)
    import qualified Data.Set as S
  2. Define the Artist data type corresponding to the CSV input. For fields that may contain conflicting data, store the value in its corresponding list. In this case, song- and genre-related data are stored in a set of strings:
    data Artist = Artist { name :: String
                         , song :: S.Set String
                         , genre :: S.Set String
                         } deriving Show
  3. Extract data from CSV and insert it in a map:
    main :: IO ()
    main = do
      let fileName = "input.csv"
      input <- readFile fileName
      let csv = parseCSV fileName input
      either handleError doWork csv
  4. Print out any error that might occur:
    handleError = print
  5. If no error occurs, then combine the data from the CSV and print it out:
    doWork :: [Record] -> IO ()
    doWork csv = print $ 
                 fromListWith combine $ 
                 map parseToTuple csv
  6. Create a map from an association list with a collision strategy defined by combine:
    combine :: Artist -> Artist -> Artist
    combine artist1 artist2 = 
        Artist { name = name artist1
               , song = S.union (song artist1) (song artist2)
               , genre = S.union (genre artist1) (genre artist2) }
  7. Make the helper functions create an association list from the CSV records:
    parseToTuple :: [String] -> (String, Artist)
    parseToTuple record = (name item, item) 
      where item = parseItem record
    
    parseItem :: Record -> Artist
    parseItem record = 
      Artist { name = nameStr
             , song = if null songStr 
                      then S.empty 
                      else S.singleton songStr
             , genre = if null genreStr 
                       then S.empty 
                       else S.singleton genreStr
             }
      where nameStr  = record !! 0
            songStr  = record !! 1
            genreStr = record !! 2
  8. The output of the program will be a map with the following information that will be collected:
    fromList [ 
    ("Daft Punk", Artist 
      {  name = "Daft Punk", 
        song = fromList ["Get Lucky","Around the World"], 
        genre = fromList ["French house"]}),
    ("Junior Boys", Artist 
      {  name = "Junior Boys", 
        song = fromList ["Bits & Pieces"], 
        genre = fromList ["Synthpop"]}),
    ("Justice", Artist 
      {  name = "Justice", 
        song = fromList ["Genesis"], 
        genre = fromList ["Electronic rock","Electro"]}),
    ("Madeon", Artist 
      {  name = "Madeon", 
        song = fromList ["Icarus"], 
        genre = fromList ["French house"]})]

How it works...

The Map data type offers a convenient function fromListWith :: Ord k => (a -> a -> a) -> [(k, a)] -> Map k a to easily combine data in Map. We use it to find out whether a key already exists. If so, we combine the fields in the old and new items and store them under the key.

We use a set to efficiently combine these data fields.

There's more...

If dealing with larger numbers, it may be wise to use Data.Hashmap.Map instead because the running time for n items is O(min(n, W)), where W is the number of bits in an integer (32 or 64).

For even better performance, Data.Hashtable.Hashtable provides O(1) performance for lookups but adds complexity by being in an I/O monad.

See also

If the corpus contains nonconflicting information about duplicated data, see the previous section on Deduplication of nonconflicting data items.

主站蜘蛛池模板: 安仁县| 广州市| 内江市| 焦作市| 建水县| 葫芦岛市| 岳池县| 昭苏县| 突泉县| 乐清市| 潞西市| 民勤县| 奎屯市| 故城县| 新闻| 五家渠市| 临夏县| 革吉县| 娄底市| 阜宁县| 朝阳区| 龙陵县| 项城市| 清河县| 察隅县| 舞钢市| 津市市| 舞钢市| 德令哈市| 遵义市| 泰兴市| 西充县| 泗阳县| 梁河县| 木兰县| 龙门县| 陇南市| 浮山县| 高唐县| 安顺市| 花垣县|