- Mastering Apache Spark 2.x(Second Edition)
- Romeo Kienzler
- 136字
- 2021-07-02 18:55:28
Processing the Parquet files
Apache Parquet is another columnar-based data format used by many tools in the Hadoop ecosystem, such as Hive, Pig, and Impala. It increases performance using efficient compression, columnar layout, and encoding routines. The Parquet processing example is very similar to the JSON Scala code. The DataFrame is created and then saved in Parquet format using the write method with a parquet type:
df.write.parquet("hdfs://localhost:9000/tmp/test.parquet")
This results in an HDFS directory, which contains eight parquet files:

For more information about possible SparkContext and SparkSession methods, check the API documentation of the classes called org.apache.spark.SparkContext and org.apache.spark.sql.SparkSession, using the Apache Spark API reference at http://spark.apache.org/docs/latest/api/scala/index.html.
In the next section, we will examine Apache Spark DataFrames. They were introduced in Spark 1.3 and have become one of the first-class citizens in Apache Spark 1.5 and 1.6.
推薦閱讀
- 軟件安全技術
- UI設計基礎培訓教程
- Visual Basic程序開發(學習筆記)
- Offer來了:Java面試核心知識點精講(原理篇)
- NLTK基礎教程:用NLTK和Python庫構建機器學習應用
- Java面向對象程序開發及實戰
- Jupyter數據科學實戰
- WebRTC技術詳解:從0到1構建多人視頻會議系統
- Arduino家居安全系統構建實戰
- 新一代SDN:VMware NSX 網絡原理與實踐
- PHP編程基礎與實踐教程
- Android Sensor Programming By Example
- Clojure Polymorphism
- 深入理解Kafka:核心設計與實踐原理
- 例解Python:Python編程快速入門踐行指南