官术网_书友最值得收藏!

What is an RDD?

RDD is at the heart of every Spark application. Let's understand the meaning of each word in more detail:

  • Resilient: If we look at the meaning of resilient in the dictionary, we can see that it means to be: able to recover quickly from difficult conditions. Spark RDD has the ability to recreate itself if something goes wrong. You must be wondering, why does it need to recreate itself? Remember how HDFS and other data stores achieve fault tolerance? Yes, these systems maintain a replica of the data on multiple machines to recover in case of failure. But, as discussed in Chapter 1Introduction to Apache Spark, Spark is not a data store; Spark is an execution engine. It reads the data from source systems, transforms it, and loads it into the target system. If something goes wrong while performing any of the previous steps, we will lose the data. To provide the fault tolerance while processing, an RDD is made resilient: it can recompute itself. Each RDD maintains some information about its parent RDD and how it was created from its parent. This introduces us to the concept of Lineage. The information about maintaining the parent and the operation is known as lineage. Lineage can only be achieved if your data is immutable. What do I mean by that? If you lose the current state of an object and you are sure that previous state will never change, then you can always go back and use its past state with the same operations, and you will always recover the current state of the object. This is exactly what happens in the case of RDDs. If you are finding this difficult, don't worry! It will become clear when we look at how RDDs are created. 
    Immutability also brings another advantage: optimization. If you know something will not change, you always have the opportunity to optimize it. If you pay close attention, all of these concepts are connected, as the following diagram illustrates:
  RDD
  • Distributed: As mentioned in the following bullet point, a dataset is nothing but a collection of objects. An RDD can distribute its dataset across a set of machines, and each of these machines will be responsible for processing its partition of data. If you come from a Hadoop MapReduce background, you can imagine partitions as the input splits for the map phase.
  • Dataset: A dataset is just a collection of objects. These objects can be a Scala, Java, or Python complex object; numbers; strings; rows of a database; and more.

Every Spark program boils down to an RDD. A Spark program written in Spark SQL, DataFrame, or dataset gets converted to an RDD at the time of execution.

The following diagram illustrates an RDD of numbers (1 to 18) having nine partitions on a cluster of three nodes:

RDD
主站蜘蛛池模板: 濮阳市| 东乡| 应城市| 连云港市| 林州市| 苏尼特右旗| 陆良县| 钦州市| 绥宁县| 定南县| 衡水市| 南郑县| 延津县| 榕江县| 砀山县| 泾源县| 开原市| 澄迈县| 温州市| 香格里拉县| 东阿县| 吴堡县| 闽清县| 崇左市| 民乐县| 台前县| 饶平县| 云南省| 长宁区| 图们市| 汉寿县| 南平市| 长沙市| 洛阳市| 兴山县| 嘉定区| 金平| 营山县| 潮安县| 呼玛县| 敦化市|