- Apache Spark 2.x for Java Developers
- Sourav Gulati Sumit Kumar
- 493字
- 2021-07-02 19:01:54
Benefits of RDD
Following are some benefits that Spark RDD model provides over Hadoop MapReduce Model:
- Iterative processing: One of the biggest issue, with MapReduce processing is the IO (Input/Output) involved. It really slows down the process of MapReduce if you are running iterative operations where you would basically chain MapReduce jobs to perform multiple aggregations.
Consider running a MapReduce job that reads data from HDFS and performs some aggregation and writes the output back to HDFS. Now, mapper jobs will read data from HDFS and write the output to the local filesystem after completion and Reduce pulls that data and runs the reduce process on it. After which, it writes the output to HDFS (not considering the spill mechanism of mapper and reducer).
Now, let's say you want to perform another aggregation on the output data so you will execute another MapReduce job on the output data which will go through a similar I/O process. So the following is the logical representation of how iterative operations will run in MapReduce.

On the other hand, Spark will not perform such I/O in most of the cases for the job previously described. Data will be read from HDFS once and then Spark will perform in memory transformation on RDD for every iteration. The output of every step (that is, another RDD) will be stored in the distributed cluster memory. The following is the logical representation of the same job in Spark:

Now, here is a catch. What if the size of the intermediate results is more than the distributed memory size? In that case, Spark will spill that RDD to disk.
- Interactive Processing: Another benefit of the data structure of Spark over MapReduce or Hadoop can be seen when the user wants to run some ad-hoc queries on the data placed on some stable storage.
Let's say you are trying to run some MapReduce jobs (or Hive queries) on the data to do some analysis. If you are running multiple queries on same input data, MapReduce will read the data from storage, let's say HDFS, every time you run the query. A logical representation of that can be as follows:

On the other hand, Spark provides a mechanism to persist an RDD in memory (different mechanisms of persisting RDD will be discussed later in Chapter 4, Understanding the Spark Programming Model). So, you can execute one job and save RDD in memory. Then, other analytics can be executed on the same RDD without reading the data from HDFS again. The following is the logical representation of that:

When a Spark job encounters Spark Action 1, it executes the DAG and calculates the RDD. Then the RDD will be persisted in memory and Spark Action 1 will be performed on the RDD. Afterwards, Spark Action 2 and Spark Action 3 will be performed in the same RDD. So, this model helps to save lot of I/O from the stable storage in case of interactive processing.
- Spring 5企業(yè)級開發(fā)實戰(zhàn)
- SQL Server 2016從入門到精通(視頻教學(xué)超值版)
- Visual Basic程序設(shè)計教程
- HTML5+CSS3基礎(chǔ)開發(fā)教程(第2版)
- 編程珠璣(續(xù))
- Scratch 3游戲與人工智能編程完全自學(xué)教程
- TypeScript圖形渲染實戰(zhàn):基于WebGL的3D架構(gòu)與實現(xiàn)
- 微信公眾平臺開發(fā):從零基礎(chǔ)到ThinkPHP5高性能框架實踐
- Visual Basic學(xué)習(xí)手冊
- Java設(shè)計模式及實踐
- Mastering Apache Spark 2.x(Second Edition)
- Raspberry Pi Home Automation with Arduino(Second Edition)
- Learning Docker Networking
- 從零開始:UI圖標(biāo)設(shè)計與制作(第3版)
- Mastering Apache Storm