官术网_书友最值得收藏!

RDD - the first citizen of Spark

The very first paper on RDD Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing described it as follows:

Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. As Spark is written in a functional programming paradigm, one of the key concepts of functional programming is immutable objects. Resilient Distributed Dataset is also an immutable dataset.

Formally, we can define an RDD as an immutable distributed collection of objects. It is the primary data type of Spark. It leverages cluster memory and is partitioned across the cluster.

The following is the logical representation of RDD:

RDDs can consist of (key, value) pairs as well. The following is the logical representation of pair of RDDs:

Also, as mentioned, RDD can be partitioned across the cluster. So the following is the logical representation of partitioned RDDs in a cluster:

主站蜘蛛池模板: 凭祥市| 巴东县| 永州市| 南阳市| 高州市| 夹江县| 会泽县| 长顺县| 策勒县| 神木县| 彰化市| 尼玛县| 渭南市| 宜君县| 南江县| 垫江县| 平顶山市| 鹤峰县| 开鲁县| 井冈山市| 莆田市| 平泉县| 合阳县| 三门峡市| 三河市| 揭东县| 大丰市| 乌海市| 大英县| 离岛区| 河南省| 巴林左旗| 屏山县| 和平区| 凉城县| 济南市| 榕江县| 大同县| 大邑县| 云林县| 托克托县|