官术网_书友最值得收藏!

What's new in Apache Spark V2?

Since Apache Spark V2, many things have changed. This doesn't mean that the API has been broken. In contrast, most of the V1.6 Apache Spark applications will run on Apache Spark V2 with or without very little changes, but under the hood, there have been a lot of changes.

The first and most interesting thing to mention is the newest functionalities of the Catalyst Optimizer, which we will cover in detail in Chapter 3, The Catalyst Optimizer. Catalyst creates a Logical Execution Plan (LEP) from a SQL query and optimizes this LEP to create multiple Physical Execution Plans (PEPs). Based on statistics, Catalyst chooses the best PEP to execute. This is very similar to cost-based optimizers in Relational Data Base Management Systems (RDBMs). Catalyst makes heavy use of Project Tungsten, a component that we will cover in Chapter 4, Apache Spark Streaming.

Although the Java Virtual Machine (JVM) is a masterpiece on its own, it is a general-purpose byte code execution engine. Therefore, there is a lot of JVM object management and garbage collection (GC) overhead. So, for example, to store a 4-byte string, 48 bytes on the JVM are needed. The GC optimizes on object lifetime estimation, but Apache Spark often knows this better than JVM. Therefore, Tungsten disables the JVM GC for a subset of privately managed data structures to make them L1/L2/L3 Cache-friendly.

In addition, code generation removed the boxing of primitive types polymorphic function dispatching. Finally, a new first-class citizen called Dataset unified the RDD and DataFrame APIs. Datasets are statically typed and avoid runtime type errors. Therefore, Datasets can be used only with Java and Scala. This means that Python and R users still have to stick to DataFrames, which are kept in Apache Spark V2 for backward compatibility reasons.

主站蜘蛛池模板: 白城市| 营山县| 湛江市| 陆丰市| SHOW| 牙克石市| 松江区| 九台市| 称多县| 金湖县| 乌恰县| 哈密市| 宾川县| 栾川县| 北宁市| 那曲县| 四子王旗| 东源县| 齐齐哈尔市| 东辽县| 眉山市| 沈阳市| 赣榆县| 西畴县| 兴隆县| 介休市| 玛多县| 武隆县| 太谷县| 晴隆县| 政和县| 塘沽区| 宁津县| 大同县| 梧州市| 永嘉县| 炉霍县| 西乌珠穆沁旗| 清涧县| 玉门市| 冷水江市|