官术网_书友最值得收藏!

Processing the Parquet files

Apache Parquet is another columnar-based data format used by many tools in the Hadoop ecosystem, such as Hive, Pig, and Impala. It increases performance using efficient compression, columnar layout, and encoding routines. The Parquet processing example is very similar to the JSON Scala code. The DataFrame is created and then saved in Parquet format using the write method with a parquet type:

df.write.parquet("hdfs://localhost:9000/tmp/test.parquet")

This results in an HDFS directory, which contains eight parquet files:

For more information about possible SparkContext and SparkSession methods, check the API documentation of the classes called org.apache.spark.SparkContext and org.apache.spark.sql.SparkSession, using the Apache Spark API reference at http://spark.apache.org/docs/latest/api/scala/index.html.

In the next section, we will examine Apache Spark DataFrames. They were introduced in Spark 1.3 and have become one of the first-class citizens in Apache Spark 1.5 and 1.6.

主站蜘蛛池模板: 丰宁| 沙河市| 安图县| 曲麻莱县| 和田市| 商洛市| 芷江| 河北省| 财经| 昆明市| 南澳县| 大宁县| 湄潭县| 洛隆县| 城口县| 巴南区| 神池县| 彝良县| 绥滨县| 萨迦县| 西峡县| 海林市| 甘肃省| 景宁| 白山市| 重庆市| 伊宁市| 阜南县| 屯昌县| 江孜县| 东乡族自治县| 邹城市| 襄城县| 庆阳市| 仪征市| 赣州市| 资源县| 波密县| 什邡市| 榆林市| 神木县|