- Apache Spark 2.x for Java Developers
- Sourav Gulati Sumit Kumar
- 156字
- 2021-07-02 19:01:53
RDD - the first citizen of Spark
The very first paper on RDD Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing described it as follows:
Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. As Spark is written in a functional programming paradigm, one of the key concepts of functional programming is immutable objects. Resilient Distributed Dataset is also an immutable dataset.
Formally, we can define an RDD as an immutable distributed collection of objects. It is the primary data type of Spark. It leverages cluster memory and is partitioned across the cluster.
The following is the logical representation of RDD:

RDDs can consist of (key, value) pairs as well. The following is the logical representation of pair of RDDs:

Also, as mentioned, RDD can be partitioned across the cluster. So the following is the logical representation of partitioned RDDs in a cluster:

- Learning Python Web Penetration Testing
- LabVIEW 2018 虛擬儀器程序設計
- 零基礎學C++程序設計
- Beginning C++ Game Programming
- Visual FoxPro程序設計教程(第3版)
- Django Design Patterns and Best Practices
- 零基礎學Java程序設計
- ADI DSP應用技術集錦
- Python漫游數學王國:高等數學、線性代數、數理統計及運籌學
- Hands-On Full Stack Development with Go
- Zabbix Performance Tuning
- 遠方:兩位持續創業者的點滴思考
- Python+Office:輕松實現Python辦公自動化
- 自己動手構建編程語言:如何設計編譯器、解釋器和DSL
- Jakarta EE Cookbook