一起去捕鱼赢微信红包

書名： Mastering Spark for Data Science
作者名： Andrew Morgan Antoine Amend David George Matthew Hallett
本章字數(shù)： 318字
更新時間： 2021-07-09 18:49:34

Parquet

Apache Parquet is a columnar storage format specifically designed for the Hadoop ecosystem. Traditional row-based storage formats are optimized to work with one record at a time, meaning they can be slow for certain types of workload. Instead, Parquet serializes and stores data by column, thus allowing for optimization of storage, compression, predicate processing, and bulk sequential access across large datasets - exactly the type of workload suited to Spark!

As Parquet implements per column data compaction, it's particularly suited to CSV data, especially with fields of low cardinality, and file sizes can see huge reductions when compared to Avro.

+--------------------------+--------------+ 
|                 File Type|          Size| 
+--------------------------+--------------+ 
|20160101020000.gkg.csv    |      20326266| 
|20160101020000.gkg.avro   |      13557119| 
|20160101020000.gkg.parquet|       6567110| 
|20160101020000.gkg.csv.bz2|       4028862| 
+--------------------------+--------------+

Parquet also integrates with Avro natively. Parquet takes an Avro in-memory representation of data and maps to its internal data types. It then serializes the data to disk using the Parquet columnar file format.

We have seen how to apply Avro to the model, now we can take the next step and use this Avro model to persist data to disk via the Parquet format. Again, we will show the current method and then some lower-level code for demonstrative purposes. First, the recommended method:

val gdeltAvroDF = spark 
    .read
    .format("com.databricks.spark.avro")
    .load("/path/to/avro/output")

gdeltAvroDF.write.parquet("/path/to/parquet/output")

Now for the detail behind how Avro and Parquet relate to each other:

val inputFile = new File("("/path/to/avro/output ")
 val outputFile = new Path("/path/to/parquet/output")
 
 val schema = Specification.getClassSchema
 val reader =  new GenericDatumReader[IndexedRecord](schema)
 val avroFileReader = DataFileReader.openReader(inputFile, reader)

 val parquetWriter =
     new AvroParquetWriter[IndexedRecord](outputFile, schema)

 while(avroFileReader.hasNext)  {
     parquetWriter.write(dataFileReader.next())
 }

 
 dataFileReader.close()
 parquetWriter.close()

As before, the lower-level code is quite verbose, although it does give some insight into the various steps required. You can find the full code in our repository.

We now have a great model to store and retrieve our GKG data that uses Avro and Parquet and can easily be implemented using DataFrames.

官术网_书友最值得收藏!

Mastering Spark for Data Science

Parquet