官术网_书友最值得收藏!

Processing the Parquet files

Apache Parquet is another columnar-based data format used by many tools in the Hadoop ecosystem, such as Hive, Pig, and Impala. It increases performance using efficient compression, columnar layout, and encoding routines. The Parquet processing example is very similar to the JSON Scala code. The DataFrame is created and then saved in Parquet format using the write method with a parquet type:

df.write.parquet("hdfs://localhost:9000/tmp/test.parquet")

This results in an HDFS directory, which contains eight parquet files:

For more information about possible SparkContext and SparkSession methods, check the API documentation of the classes called org.apache.spark.SparkContext and org.apache.spark.sql.SparkSession, using the Apache Spark API reference at http://spark.apache.org/docs/latest/api/scala/index.html.

In the next section, we will examine Apache Spark DataFrames. They were introduced in Spark 1.3 and have become one of the first-class citizens in Apache Spark 1.5 and 1.6.

主站蜘蛛池模板: 东辽县| 昔阳县| 蕲春县| 安新县| 榆中县| 百色市| 政和县| 海宁市| 伊宁市| 天全县| 鹤岗市| 布拖县| 徐汇区| 子长县| 安阳县| 大邑县| 石狮市| 吐鲁番市| 长岛县| 萨迦县| 宁城县| 禹州市| 田林县| 中山市| 弥勒县| 贵阳市| 达孜县| 洛川县| 时尚| 六安市| 特克斯县| 保康县| 湖南省| 姚安县| 黄陵县| 垫江县| 布拖县| 阳新县| 德安县| 乌兰察布市| 朝阳县|