官术网_书友最值得收藏!

RDD - the first citizen of Spark

The very first paper on RDD Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing described it as follows:

Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. As Spark is written in a functional programming paradigm, one of the key concepts of functional programming is immutable objects. Resilient Distributed Dataset is also an immutable dataset.

Formally, we can define an RDD as an immutable distributed collection of objects. It is the primary data type of Spark. It leverages cluster memory and is partitioned across the cluster.

The following is the logical representation of RDD:

RDDs can consist of (key, value) pairs as well. The following is the logical representation of pair of RDDs:

Also, as mentioned, RDD can be partitioned across the cluster. So the following is the logical representation of partitioned RDDs in a cluster:

主站蜘蛛池模板: 芦溪县| 揭阳市| 新昌县| 新昌县| 虹口区| 哈尔滨市| 高邑县| 通河县| 平昌县| 湖南省| 繁昌县| 庆安县| 朝阳区| 安塞县| 平远县| 阜阳市| 承德市| 宁强县| 龙泉市| 读书| 漳平市| 凌海市| 湾仔区| 榆林市| 邢台市| 海南省| 康保县| 福鼎市| 贺州市| 合阳县| 安陆市| 东丰县| 益阳市| 甘泉县| 杭锦旗| 常熟市| 普宁市| 漳平市| 新巴尔虎左旗| 白水县| 琼海市|