官术网_书友最值得收藏!

Apache Spark fundamentals

This section covers the Apache Spark fundamentals. It is important to become very familiar with the concepts that are presented here before moving on to the next chapters, where we'll be exploring the available APIs.

As mentioned in the introduction to this chapter, the Spark engine processes data in distributed memory across the nodes of a cluster. The following diagram shows the logical structure of how a typical Spark job processes information:

Figure 1.1

Spark executes a job in the following way:

Figure 1.2

The Master controls how data is partitioned and takes advantage of data locality while keeping track of all the distributed data computation on the Slave machines. If a certain Slave machine becomes unavailable, the data on that machine is reconstructed on another available machine(s). In standalone mode, the Master is a single point of failure. This chapter's Cluster mode using different managers section covers the possible running modes and explains fault tolerance in Spark.

Spark comes with five major components:

Figure 1.3

These components are as follows:

  • The core engine.
  • Spark SQL: A module for structured data processing.
  • Spark Streaming: This extends the core Spark API. It allows live data stream processing. Its strengths include scalability, high throughput, and fault tolerance.
  • MLib: The Spark machine learning library.
  • GraphX: Graphs and graph-parallel computation algorithms.

Spark can access data that's stored in different systems, such as HDFS, Cassandra, MongoDB, relational databases, and also cloud storage services such as Amazon S3 and Azure Data Lake Storage.

主站蜘蛛池模板: 天门市| 林口县| 改则县| 荥经县| 邹平县| 岫岩| 如东县| 宿迁市| 林口县| 乌恰县| 龙陵县| 蓝山县| 开封县| 绥棱县| 镇安县| 大竹县| 崇仁县| 德化县| 交城县| 凤翔县| 洞口县| 砀山县| 永宁县| 尤溪县| 板桥市| 林州市| 涞水县| 仙游县| 景东| 连云港市| 南乐县| 札达县| 福建省| 柳州市| 门源| 姜堰市| 巴东县| 孝感市| 西畴县| 岚皋县| 罗甸县|