官术网_书友最值得收藏!

Apache Spark SQL

In this chapter, we will examine ApacheSparkSQL, SQL, DataFrames, and Datasets on top of Resilient Distributed Datasets (RDDs). DataFrames were introduced in Spark 1.3, basically replacing SchemaRDDs, and are columnar data storage structures roughly equivalent to relational database tables, whereas Datasets were introduced as experimental in Spark 1.6 and have become an additional component in Spark 2.0.

We have tried to reduce the dependency between individual chapters as much as possible in order to give you the opportunity to work through them as you like. However, we do recommend that you read this chapter because the other chapters are dependent on the knowledge of DataFrames and Datasets.

This chapter will cover the following topics:

  • SparkSession
  • Importing and saving data
  • Processing the text files
  • Processing the JSON files
  • Processing the Parquet files
  • DataSource API
  • DataFrames
  • Datasets
  • Using SQL
  • User-defined functions
  • RDDs versus DataFrames versus Datasets

Before moving on to SQL, DataFrames, and Datasets, we will cover an overview of the SparkSession.

主站蜘蛛池模板: 平原县| 德令哈市| 寿光市| 遵义县| 时尚| 庆阳市| 塘沽区| 安丘市| 禄劝| 广元市| 太原市| 古交市| 衡南县| 陕西省| 光山县| 蒲城县| 合作市| 塔河县| 翼城县| 蒙阴县| 云安县| 石屏县| 杨浦区| 永德县| 女性| 宜兰县| 道孚县| 雷州市| 遂昌县| 太康县| 黎平县| 永平县| 长葛市| 建阳市| 沁源县| 若尔盖县| 宜城市| 兴义市| 岳普湖县| 察雅县| 高雄县|