官术网_书友最值得收藏!

Processing the Parquet files

Apache Parquet is another columnar-based data format used by many tools in the Hadoop ecosystem, such as Hive, Pig, and Impala. It increases performance using efficient compression, columnar layout, and encoding routines. The Parquet processing example is very similar to the JSON Scala code. The DataFrame is created and then saved in Parquet format using the write method with a parquet type:

df.write.parquet("hdfs://localhost:9000/tmp/test.parquet")

This results in an HDFS directory, which contains eight parquet files:

For more information about possible SparkContext and SparkSession methods, check the API documentation of the classes called org.apache.spark.SparkContext and org.apache.spark.sql.SparkSession, using the Apache Spark API reference at http://spark.apache.org/docs/latest/api/scala/index.html.

In the next section, we will examine Apache Spark DataFrames. They were introduced in Spark 1.3 and have become one of the first-class citizens in Apache Spark 1.5 and 1.6.

主站蜘蛛池模板: 吉首市| 修武县| 新干县| 呼和浩特市| 高碑店市| 铁岭市| 英超| 若羌县| 新晃| 彭州市| 息烽县| 江源县| 吉木乃县| 泾源县| 石渠县| 灵宝市| 苏尼特右旗| 邯郸县| 昌吉市| 临高县| 逊克县| 营山县| 江阴市| 沙雅县| 西盟| 绥德县| 阿拉善左旗| 台北县| 新津县| 临西县| 南康市| 印江| 达拉特旗| 辉县市| 盐池县| 泽普县| 达尔| 高唐县| 剑阁县| 广灵县| 绍兴市|