官术网_书友最值得收藏!

Apache Spark fundamentals

This section covers the Apache Spark fundamentals. It is important to become very familiar with the concepts that are presented here before moving on to the next chapters, where we'll be exploring the available APIs.

As mentioned in the introduction to this chapter, the Spark engine processes data in distributed memory across the nodes of a cluster. The following diagram shows the logical structure of how a typical Spark job processes information:

Figure 1.1

Spark executes a job in the following way:

Figure 1.2

The Master controls how data is partitioned and takes advantage of data locality while keeping track of all the distributed data computation on the Slave machines. If a certain Slave machine becomes unavailable, the data on that machine is reconstructed on another available machine(s). In standalone mode, the Master is a single point of failure. This chapter's Cluster mode using different managers section covers the possible running modes and explains fault tolerance in Spark.

Spark comes with five major components:

Figure 1.3

These components are as follows:

  • The core engine.
  • Spark SQL: A module for structured data processing.
  • Spark Streaming: This extends the core Spark API. It allows live data stream processing. Its strengths include scalability, high throughput, and fault tolerance.
  • MLib: The Spark machine learning library.
  • GraphX: Graphs and graph-parallel computation algorithms.

Spark can access data that's stored in different systems, such as HDFS, Cassandra, MongoDB, relational databases, and also cloud storage services such as Amazon S3 and Azure Data Lake Storage.

主站蜘蛛池模板: 滁州市| 安陆市| 河曲县| 文安县| 周宁县| 沅陵县| 苗栗市| 民勤县| 印江| 进贤县| 鹿邑县| 青龙| 繁昌县| 金昌市| 光山县| 天峻县| 海林市| 长治县| 通州区| 延安市| 五指山市| 洛宁县| 滁州市| 伊通| 龙口市| 达拉特旗| 化德县| 高陵县| 利川市| 姜堰市| 梅州市| 彭泽县| 北流市| 蛟河市| 昭觉县| 铁力市| 砀山县| 余庆县| 平安县| 进贤县| 江北区|