官术网_书友最值得收藏!

Apache Spark fundamentals

This section covers the Apache Spark fundamentals. It is important to become very familiar with the concepts that are presented here before moving on to the next chapters, where we'll be exploring the available APIs.

As mentioned in the introduction to this chapter, the Spark engine processes data in distributed memory across the nodes of a cluster. The following diagram shows the logical structure of how a typical Spark job processes information:

Figure 1.1

Spark executes a job in the following way:

Figure 1.2

The Master controls how data is partitioned and takes advantage of data locality while keeping track of all the distributed data computation on the Slave machines. If a certain Slave machine becomes unavailable, the data on that machine is reconstructed on another available machine(s). In standalone mode, the Master is a single point of failure. This chapter's Cluster mode using different managers section covers the possible running modes and explains fault tolerance in Spark.

Spark comes with five major components:

Figure 1.3

These components are as follows:

  • The core engine.
  • Spark SQL: A module for structured data processing.
  • Spark Streaming: This extends the core Spark API. It allows live data stream processing. Its strengths include scalability, high throughput, and fault tolerance.
  • MLib: The Spark machine learning library.
  • GraphX: Graphs and graph-parallel computation algorithms.

Spark can access data that's stored in different systems, such as HDFS, Cassandra, MongoDB, relational databases, and also cloud storage services such as Amazon S3 and Azure Data Lake Storage.

主站蜘蛛池模板: 塔城市| 怀柔区| 虎林市| 淳化县| 双江| 勃利县| 天镇县| 宝山区| 社旗县| 海林市| 枣强县| 东乡族自治县| 东乡族自治县| 罗平县| 鄂托克前旗| 建德市| 常州市| 阿克| 泸定县| 南召县| 丹东市| 合作市| 湛江市| 北安市| 武隆县| 河津市| 眉山市| 桐梓县| 玛纳斯县| 达州市| 芷江| 江源县| 清镇市| 都兰县| 浑源县| 武乡县| 额济纳旗| 怀集县| 湘乡市| 金沙县| 洛隆县|