- Mastering Apache Spark 2.x(Second Edition)
- Romeo Kienzler
- 163字
- 2021-07-02 18:55:27
Apache Spark SQL
In this chapter, we will examine ApacheSparkSQL, SQL, DataFrames, and Datasets on top of Resilient Distributed Datasets (RDDs). DataFrames were introduced in Spark 1.3, basically replacing SchemaRDDs, and are columnar data storage structures roughly equivalent to relational database tables, whereas Datasets were introduced as experimental in Spark 1.6 and have become an additional component in Spark 2.0.
We have tried to reduce the dependency between individual chapters as much as possible in order to give you the opportunity to work through them as you like. However, we do recommend that you read this chapter because the other chapters are dependent on the knowledge of DataFrames and Datasets.
This chapter will cover the following topics:
- SparkSession
- Importing and saving data
- Processing the text files
- Processing the JSON files
- Processing the Parquet files
- DataSource API
- DataFrames
- Datasets
- Using SQL
- User-defined functions
- RDDs versus DataFrames versus Datasets
Before moving on to SQL, DataFrames, and Datasets, we will cover an overview of the SparkSession.
- 編譯系統(tǒng)透視:圖解編譯原理
- Linux Device Drivers Development
- Learning SciPy for Numerical and Scientific Computing(Second Edition)
- Android玩家必備
- Android移動(dòng)應(yīng)用開發(fā)項(xiàng)目教程
- 大學(xué)計(jì)算機(jī)基礎(chǔ)實(shí)訓(xùn)教程
- Mastering Embedded Linux Programming
- LabVIEW數(shù)據(jù)采集
- Python Social Media Analytics
- 快樂編程:青少年思維訓(xùn)練
- Appcelerator Titanium Smartphone App Development Cookbook
- Learning Unity Physics
- OpenCL異構(gòu)并行計(jì)算:原理、機(jī)制與優(yōu)化實(shí)踐
- Cadence Concept-HDL & Allegro原理圖與電路板設(shè)計(jì)(第2版)
- Go底層原理與工程化實(shí)踐