官术网_书友最值得收藏!

Processing the Parquet files

Apache Parquet is another columnar-based data format used by many tools in the Hadoop ecosystem, such as Hive, Pig, and Impala. It increases performance using efficient compression, columnar layout, and encoding routines. The Parquet processing example is very similar to the JSON Scala code. The DataFrame is created and then saved in Parquet format using the write method with a parquet type:

df.write.parquet("hdfs://localhost:9000/tmp/test.parquet")

This results in an HDFS directory, which contains eight parquet files:

For more information about possible SparkContext and SparkSession methods, check the API documentation of the classes called org.apache.spark.SparkContext and org.apache.spark.sql.SparkSession, using the Apache Spark API reference at http://spark.apache.org/docs/latest/api/scala/index.html.

In the next section, we will examine Apache Spark DataFrames. They were introduced in Spark 1.3 and have become one of the first-class citizens in Apache Spark 1.5 and 1.6.

主站蜘蛛池模板: 双鸭山市| 耒阳市| 白城市| 台南市| 巴彦县| 黄山市| 公安县| 始兴县| 焉耆| 通化市| 吉木萨尔县| 丹阳市| 长治市| 庆阳市| 分宜县| 隆昌县| 清苑县| 台山市| 万荣县| 新田县| 宁城县| 宜州市| 惠来县| 西乌珠穆沁旗| 金湖县| 蒙自县| 综艺| 行唐县| 阜新| 法库县| 和静县| 玛曲县| 基隆市| 白银市| 枝江市| 锡林浩特市| 南丰县| 康乐县| 龙口市| 丁青县| 石狮市|