官术网_书友最值得收藏!

Processing the Parquet files

Apache Parquet is another columnar-based data format used by many tools in the Hadoop ecosystem, such as Hive, Pig, and Impala. It increases performance using efficient compression, columnar layout, and encoding routines. The Parquet processing example is very similar to the JSON Scala code. The DataFrame is created and then saved in Parquet format using the write method with a parquet type:

df.write.parquet("hdfs://localhost:9000/tmp/test.parquet")

This results in an HDFS directory, which contains eight parquet files:

For more information about possible SparkContext and SparkSession methods, check the API documentation of the classes called org.apache.spark.SparkContext and org.apache.spark.sql.SparkSession, using the Apache Spark API reference at http://spark.apache.org/docs/latest/api/scala/index.html.

In the next section, we will examine Apache Spark DataFrames. They were introduced in Spark 1.3 and have become one of the first-class citizens in Apache Spark 1.5 and 1.6.

主站蜘蛛池模板: 佛教| 边坝县| 上饶市| 阿勒泰市| 平阴县| 常山县| 建宁县| 黑龙江省| 平山县| 乌鲁木齐市| 东丽区| 伊吾县| 咸阳市| 贵南县| 巴南区| 三明市| 广宁县| 辽阳市| 威信县| 张家港市| 临泉县| 思茅市| 彰武县| 博白县| 丰镇市| 固安县| 汕尾市| 阿鲁科尔沁旗| 修武县| 鹤岗市| 青川县| 寿阳县| 贞丰县| 博客| 裕民县| 项城市| 永善县| 克山县| 清远市| 长垣县| 理塘县|