- Apache Spark 2.x for Java Developers
- Sourav Gulati Sumit Kumar
- 673字
- 2021-07-02 19:01:53
Why Apache Spark?
MapReduce is on its way to being a legacy. We've got Spark, says the man behind Apache Hadoop, Doug Cutting. MapReduce is an amazingly popular distributed framework but it comes with its fair amount of criticism as well. Since its inception, MapReduce has been built to run jobs in batch mode. Although it supports streaming, it's not very well suited for ad-hoc queries, machine learning, and so on. Apache Spark is a distributed in-memory computing framework, and somehow tries to address some of the major concern that surrounds MapReduce:
- Performance: A major bottleneck in MapReduce jobs are disk I/Os, and it is considerably visible during the shuffle and sort phase of MR, as data is written to disk. The guiding principal that Spark follows is simple: share the memory across the cluster and keep everything in memory as long as possible. This greatly enhances the performance of Spark jobs to the tune of 100X when compared to MR (as claimed by their developers).
- Fault tolerance: Both MR and Spark have different approaches in handling fault tolerance. The AM keeps a track of mappers and reducers while executing MR jobs. As and when these containers stop responding or fail upfront, the AM after requesting the RM, launches a separate JVM to run such tasks. While this approach achieves fault tolerance, it is both time and resource consuming. Apache Spark's approach of handling fault tolerance is different; it uses Resilient Distributed Datasets (RDD), a read only fault-tolerant parallel collection. RDD maintains the lineage graph so that whenever its partition gets lost it recovers the lost data by re-computing from the previous stage and thus making it more resilient.
- DAG: Chaining of MapReduce jobs is a difficult task and along with that it also has to deal with the burden of writing intermediate results on HDFS before the next job starts its execution. Spark is actually a Directed Acyclic Graph (DAG) engine, so chaining any number of job can easily be achieved. All the intermediate results are shared across memory, avoiding multiple disk I/O. Also these jobs are lazily evaluated and hence only those paths are processed which are explicitly called for computation. In Spark such triggers are called actions.
- Data processing: Spark (aka Spark Core) is not an isolated distributed compute framework. A whole lot of Spark modules have been built around Spark Core to make it more general purpose. RDD forms the main abstract in all these modules. With recent development dataframe and dataset have also been developed, which enriches RDD by providing it a schema and type safety. Nevertheless, the universality of RDD is ubiquitous across all the modules of Spark making it simple to use and it can easily be cross-referenced in different modules. Whether it is streaming, querying capability, machine learning, or graph processing, the same data can be referenced by RDD and can be interchangeably used. This is a unique appeal of RDD, which is lacking in MR, as different concept was required to handle machine learning jobs in Apache Mahout than was required in Apache Giraph.
- Compatibility: Spark has not been developed to keep only the YARN cluster in mind. It has amazing compatibility to run on Hadoop, Mesos, and even a standalone cluster mode. Similarly, Spark has not been built around HDFS and has a wide variety of acceptability as far as different filesystems are concerned. All Apache Spark does is provide compute capabilities while leaving the choice of choosing the cluster and filesystem to the use case being worked upon.
- Spark APIs: Spark APIs have wide coverage as far as functionality and programming languages are concerned. Spark's APIs have huge similarities with the Scala collection, the language in which Apache Spark has been majorly implemented. It is this richness of functional programming that makes Apache spark avoid much of the boilerplate code that eclipsed MR jobs. Unlike MR, which dealt with low level programming, Spark exposes APIs at a higher abstraction level with a scope of overriding any bare metal code if ever required.
推薦閱讀
- INSTANT Mock Testing with PowerMock
- Spring Boot 2實戰之旅
- The Supervised Learning Workshop
- Vue.js設計與實現
- LabVIEW Graphical Programming Cookbook
- 騰訊iOS測試實踐
- concrete5 Cookbook
- C# 8.0核心技術指南(原書第8版)
- SQL Server與JSP動態網站開發
- Visual Foxpro 9.0數據庫程序設計教程
- 新一代SDN:VMware NSX 網絡原理與實踐
- Getting Started with React Native
- Orleans:構建高性能分布式Actor服務
- 零基礎學Python編程(少兒趣味版)
- Android 游戲開發大全(第二版)