官术网_书友最值得收藏!

Spark components

Before moving any further let's first understand the common terminologies associated with Spark:

  • Driver: This is the main program that oversees the end-to-end execution of a Spark job or program. It negotiates the resources with the resource manager of the cluster for delegate and orchestrate the program into smallest possible data local parallel programming unit.
  • Executors: In any Spark job, there can be one or more executors, that is, processes that execute smaller tasks delegated by the driver. The executors process the data, preferably local to the node and store the result in memory, disk, or both.
  • Master: Apache Spark has been implemented in master-slave architecture and hence master refers to the cluster node executing the driver program.
  • Slave: In a distributed cluster mode, slave refers to the nodes on which executors are being run and hence there can be (and mostly is) more than one slave in the cluster.
  • Job: This is a collection of operations performed on any set of data. A typical word count job deals with reading a text file from an arbitrary source and splitting and then aggregating the words.
  • DAG: Any Spark job in a Spark engine is represented by a DAG of operations. The DAG represents the logical execution of Spark operations in a sequential order. Re-computation of RDD in case of a failure is possible lineage can be derived from the DAG.
  • Tasks: A job can be split into smaller units to be operated upon in silos which are called Tasks. Each task is executed upon by an executor on a partition of data.
  • Stages: Spark jobs can be divided logically into stages, where each stage represents a set of tasks having the same shuffle dependencies, that is, where data shuffling occurs. In shuffle map stage the tasks results are input for the next stage where as in result stage the tasks compute the action that started the evaluation of Spark job such as take(), foreach(), and collect().

Following diagram shows logical representation of how different components of Spark application interacts:

How a Spark job gets executed:

  • A Spark Job can comprise of a series of operations that are performed upon a set of data. However big or small a Spark job may be, it requires a SparkContext to execute any such job. In the previous examples of working with REPL, one would notice the use of an environment variable called sc, which is how a SparkContext is accessible in an REPL environment.
  • SparkContext creates an operator graph of different transformations of the job, but once an action gets called on such a transformation, the graph gets submitted to the DAGScheduler. Depending on the nature of the RDD or resultant being produced with narrow transformation or the wide transformation (those that require the shuffle operation), the DAGScheduler produces stages.
  • The DAGScheduler splits the DAG in such a way that each stage comprises of the same shuffle dependency with common shuffle boundaries. Also, stages can either be a shuffle map stage in which case its tasks' results are input for another stage or a result stage in which case its tasks directly compute the action that initiated a job, for example, count().
  • Stages are then submitted to TaskScheduler as TaskSets by the DAGScheduler. The TaskScheduler schedules the TaskSets via cluster manager (YARN, Mesos, and Spark standalone) and monitors its execution. In case of the failure of the any task, it is rerun and finally the results are sent to the DAGScheduler. In case the result output files are lost, then DAGScheduler resubmits such stages to the TaskScheduler to be rerun again.
  • Tasks are then scheduled on the designated executors (JVMs running on a slave node) meeting the resource and data locality constraints. Each executor can also have more than one task assigned.

Following diagram provides logical representation of different phases of Spark jJob execution:

In this section, we became familiar with different components of a Spark job. In the next section, we will learn capabilities of Spark driver's UI.

主站蜘蛛池模板: 苍梧县| 方正县| 青川县| 神木县| 仙居县| 监利县| 三明市| 封丘县| 邯郸市| 班玛县| 左贡县| 固阳县| 屏东市| 简阳市| 登封市| 远安县| 榆林市| 神农架林区| 涟水县| 彰化县| 高雄县| 含山县| 陇川县| 正蓝旗| 邵武市| 台前县| 上饶市| 金昌市| 玉屏| 太原市| 呼伦贝尔市| 利辛县| 调兵山市| 琼结县| 新田县| 邹平县| 乾安县| 巴楚县| 增城市| 青阳县| 聂拉木县|