- Mastering Apache Spark 2.x(Second Edition)
- Romeo Kienzler
- 254字
- 2021-07-02 18:55:32
Summary
This chapter started by explaining the SparkSession object and file I/O methods. It then showed that Spark- and HDFS-based data could be manipulated as both, DataFrames with SQL-like methods and Datasets as strongly typed version of Dataframes, and with Spark SQL by registering temporary tables. It has been shown that schema can be inferred using the DataSource API or explicitly defined using StructType on DataFrames or case classes on Datasets.
Next, user-defined functions were introduced to show that the functionality of Spark SQL could be extended by creating new functions to suit your needs, registering them as UDFs, and then calling them in SQL to process data. This lays the foundation for most of the subsequent chapters as the new DataFrame and Dataset API of Apache Spark is the way to go and RDDs are only used as fallback.
In the coming chapters, we'll discover why these new APIs are much faster than RDDs by taking a look at some internals of Apache SparkSQL in order to understand why Apache SparkSQL provides such dramatic performance improvements over the RDD API. This knowledge is important in order to write efficient SQL queries or data transformations on top of the DataFrame or Dataset relational API. So, it is of utmost importance that we take a look at the Apache Spark optimizer called Catalyst, which actually takes your high-level program and transforms it into efficient calls on top of the RDD API and, in later chapters, Tungsten, which is integral to the study of Apache Spark.
- iOS面試一戰到底
- Mastering Selenium WebDriver
- Vue.js快跑:構建觸手可及的高性能Web應用
- 假如C語言是我發明的:講給孩子聽的大師編程課
- Learning Unity 2D Game Development by Example
- 微信小程序開發與實戰(微課版)
- R數據科學實戰:工具詳解與案例分析
- Spring技術內幕:深入解析Spring架構與設計原理(第2版)
- Building Slack Bots
- Kotlin語言實例精解
- 大話代碼架構:項目實戰版
- 零基礎C語言學習筆記
- Getting Started with the Lazarus IDE
- micro:bit軟件指南
- Mobile Test Automation with Appium