- Apache Spark 2.x for Java Developers
- Sourav Gulati Sumit Kumar
- 156字
- 2021-07-02 19:01:53
RDD - the first citizen of Spark
The very first paper on RDD Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing described it as follows:
Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. As Spark is written in a functional programming paradigm, one of the key concepts of functional programming is immutable objects. Resilient Distributed Dataset is also an immutable dataset.
Formally, we can define an RDD as an immutable distributed collection of objects. It is the primary data type of Spark. It leverages cluster memory and is partitioned across the cluster.
The following is the logical representation of RDD:

RDDs can consist of (key, value) pairs as well. The following is the logical representation of pair of RDDs:

Also, as mentioned, RDD can be partitioned across the cluster. So the following is the logical representation of partitioned RDDs in a cluster:

- C/C++常用算法手冊(第3版)
- 深入淺出DPDK
- 微信公眾平臺開發:從零基礎到ThinkPHP5高性能框架實踐
- PhoneGap:Beginner's Guide(Third Edition)
- Getting Started with Gulp
- Spring核心技術和案例實戰
- Mastering Xamarin.Forms(Second Edition)
- Microsoft Azure Storage Essentials
- Learning AWS
- IDA Pro權威指南(第2版)
- R語言數據挖掘:實用項目解析
- 視窗軟件設計和開發自動化:可視化D++語言
- After Effects CC案例設計與經典插件(視頻教學版)
- 啊哈C語言!:邏輯的挑戰(修訂版)
- Puppet 5 Beginner's Guide(Third Edition)