官术网_书友最值得收藏!

Apache Spark SQL

In this chapter, we will examine ApacheSparkSQL, SQL, DataFrames, and Datasets on top of Resilient Distributed Datasets (RDDs). DataFrames were introduced in Spark 1.3, basically replacing SchemaRDDs, and are columnar data storage structures roughly equivalent to relational database tables, whereas Datasets were introduced as experimental in Spark 1.6 and have become an additional component in Spark 2.0.

We have tried to reduce the dependency between individual chapters as much as possible in order to give you the opportunity to work through them as you like. However, we do recommend that you read this chapter because the other chapters are dependent on the knowledge of DataFrames and Datasets.

This chapter will cover the following topics:

  • SparkSession
  • Importing and saving data
  • Processing the text files
  • Processing the JSON files
  • Processing the Parquet files
  • DataSource API
  • DataFrames
  • Datasets
  • Using SQL
  • User-defined functions
  • RDDs versus DataFrames versus Datasets

Before moving on to SQL, DataFrames, and Datasets, we will cover an overview of the SparkSession.

主站蜘蛛池模板: 天门市| 镶黄旗| 乌鲁木齐县| 兰坪| 余姚市| 明溪县| 乾安县| 西城区| 蒲城县| 楚雄市| 衡山县| 常山县| 新巴尔虎左旗| 呼伦贝尔市| 克拉玛依市| 禄丰县| 阜新市| 永宁县| 德清县| 西宁市| 乳山市| 商南县| 通山县| 稻城县| 新巴尔虎左旗| 定兴县| 黔江区| 轮台县| 宜兴市| 象州县| 泸水县| 枣强县| 区。| 陵川县| 长泰县| 江山市| 林西县| 榆树市| 庆安县| 江永县| 昌乐县|