官术网_书友最值得收藏!

Summary

This chapter started by explaining the SparkSession object and file I/O methods. It then showed that Spark- and HDFS-based data could be manipulated as both, DataFrames with SQL-like methods and Datasets as strongly typed version of Dataframes, and with Spark SQL by registering temporary tables. It has been shown that schema can be inferred using the DataSource API or explicitly defined using StructType on DataFrames or case classes on Datasets.

Next, user-defined functions were introduced to show that the functionality of Spark SQL could be extended by creating new functions to suit your needs, registering them as UDFs, and then calling them in SQL to process data. This lays the foundation for most of the subsequent chapters as the new DataFrame and Dataset API of Apache Spark is the way to go and RDDs are only used as fallback.

In the coming chapters, we'll discover why these new APIs are much faster than RDDs by taking a look at some internals of Apache SparkSQL in order to understand why Apache SparkSQL provides such dramatic performance improvements over the RDD API. This knowledge is important in order to write efficient SQL queries or data transformations on top of the DataFrame or Dataset relational API. So, it is of utmost importance that we take a look at the Apache Spark optimizer called Catalyst, which actually takes your high-level program and transforms it into efficient calls on top of the RDD API and, in later chapters, Tungsten, which is integral to the study of Apache Spark.

主站蜘蛛池模板: 兰溪市| 原平市| 松潘县| 馆陶县| 汉阴县| 大悟县| 崇仁县| 合山市| 上犹县| 金昌市| 沭阳县| 宜良县| 舒兰市| 资中县| 澎湖县| 蚌埠市| 汾阳市| 津市市| 克拉玛依市| 宁陕县| 会同县| 西昌市| 凌海市| 博兴县| 电白县| 白玉县| 炉霍县| 会同县| 威海市| 尉氏县| 南宁市| 仁化县| 洪江市| 宝清县| 安泽县| 罗江县| 普定县| 潮安县| 嘉鱼县| 山阴县| 陇南市|