官术网_书友最值得收藏!

  • Mastering Spark for Data Science
  • Andrew Morgan Antoine Amend David George Matthew Hallett
  • 318字
  • 2021-07-09 18:49:34

Parquet

Apache Parquet is a columnar storage format specifically designed for the Hadoop ecosystem. Traditional row-based storage formats are optimized to work with one record at a time, meaning they can be slow for certain types of workload. Instead, Parquet serializes and stores data by column, thus allowing for optimization of storage, compression, predicate processing, and bulk sequential access across large datasets - exactly the type of workload suited to Spark!

As Parquet implements per column data compaction, it's particularly suited to CSV data, especially with fields of low cardinality, and file sizes can see huge reductions when compared to Avro.

+--------------------------+--------------+ 
|                 File Type|          Size| 
+--------------------------+--------------+ 
|20160101020000.gkg.csv    |      20326266| 
|20160101020000.gkg.avro   |      13557119| 
|20160101020000.gkg.parquet|       6567110| 
|20160101020000.gkg.csv.bz2|       4028862| 
+--------------------------+--------------+ 

Parquet also integrates with Avro natively. Parquet takes an Avro in-memory representation of data and maps to its internal data types. It then serializes the data to disk using the Parquet columnar file format.

We have seen how to apply Avro to the model, now we can take the next step and use this Avro model to persist data to disk via the Parquet format. Again, we will show the current method and then some lower-level code for demonstrative purposes. First, the recommended method:

val gdeltAvroDF = spark 
    .read
    .format("com.databricks.spark.avro")
    .load("/path/to/avro/output")

gdeltAvroDF.write.parquet("/path/to/parquet/output")

Now for the detail behind how Avro and Parquet relate to each other:

val inputFile = new File("("/path/to/avro/output ")
 val outputFile = new Path("/path/to/parquet/output")
 
 val schema = Specification.getClassSchema
 val reader =  new GenericDatumReader[IndexedRecord](schema)
 val avroFileReader = DataFileReader.openReader(inputFile, reader)

 val parquetWriter =
     new AvroParquetWriter[IndexedRecord](outputFile, schema)

 while(avroFileReader.hasNext)  {
     parquetWriter.write(dataFileReader.next())
 }

 
 dataFileReader.close()
 parquetWriter.close()
   

As before, the lower-level code is quite verbose, although it does give some insight into the various steps required. You can find the full code in our repository.

We now have a great model to store and retrieve our GKG data that uses Avro and Parquet and can easily be implemented using DataFrames.

主站蜘蛛池模板: 台北市| 乐山市| 河间市| 鹤壁市| 介休市| 会理县| 铁岭市| 武宣县| 阜康市| 德保县| 淮阳县| 平果县| 凌海市| 澄城县| 兰溪市| 象州县| 梁河县| 绥化市| 建瓯市| 山阳县| 禄劝| 石林| 清水河县| 吉木乃县| 崇义县| 浦城县| 广元市| 章丘市| 瓦房店市| 印江| 社会| 通榆县| 荔浦县| 平利县| 古浪县| 孟州市| 枣庄市| 扎赉特旗| 南宫市| 渑池县| 吉林市|