- Apache Spark 2.x for Java Developers
- Sourav Gulati Sumit Kumar
- 156字
- 2021-07-02 19:01:53
RDD - the first citizen of Spark
The very first paper on RDD Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing described it as follows:
Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. As Spark is written in a functional programming paradigm, one of the key concepts of functional programming is immutable objects. Resilient Distributed Dataset is also an immutable dataset.
Formally, we can define an RDD as an immutable distributed collection of objects. It is the primary data type of Spark. It leverages cluster memory and is partitioned across the cluster.
The following is the logical representation of RDD:

RDDs can consist of (key, value) pairs as well. The following is the logical representation of pair of RDDs:

Also, as mentioned, RDD can be partitioned across the cluster. So the following is the logical representation of partitioned RDDs in a cluster:

- OpenCV實例精解
- PaaS程序設計
- C語言最佳實踐
- Python:Master the Art of Design Patterns
- Mastering JBoss Enterprise Application Platform 7
- Mastering Linux Security and Hardening
- 智能搜索和推薦系統:原理、算法與應用
- Hands-On GUI Programming with C++ and Qt5
- 算法設計與分析:基于C++編程語言的描述
- 現代CPU性能分析與優化
- 計算機組裝與維護(第二版)
- 輕松學Scratch 3.0 少兒編程(全彩)
- HTML5/CSS3/JavaScript技術大全
- Python Business Intelligence Cookbook
- Instant JRebel