官术网_书友最值得收藏!

RDD - the first citizen of Spark

The very first paper on RDD Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing described it as follows:

Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. As Spark is written in a functional programming paradigm, one of the key concepts of functional programming is immutable objects. Resilient Distributed Dataset is also an immutable dataset.

Formally, we can define an RDD as an immutable distributed collection of objects. It is the primary data type of Spark. It leverages cluster memory and is partitioned across the cluster.

The following is the logical representation of RDD:

RDDs can consist of (key, value) pairs as well. The following is the logical representation of pair of RDDs:

Also, as mentioned, RDD can be partitioned across the cluster. So the following is the logical representation of partitioned RDDs in a cluster:

主站蜘蛛池模板: 武汉市| 建瓯市| 靖远县| 德庆县| 资源县| 旬邑县| 翼城县| 嘉荫县| 河东区| 舟山市| 封开县| 富川| 涟源市| 长宁县| 嫩江县| 常德市| 特克斯县| 恩施市| 渭南市| 保山市| 崇信县| 侯马市| 大同市| 乐平市| 西昌市| 名山县| 芜湖县| 阿勒泰市| 定襄县| 吉木萨尔县| 资溪县| 通许县| 白水县| 来凤县| 临漳县| 南平市| 治多县| 广水市| 依兰县| 田林县| 大洼县|