官术网_书友最值得收藏!

Why Hadoop plus Spark?

Apache Spark shines better when it is combined with Hadoop. To understand this, let's take a look at Hadoop and Spark features.

Hadoop features

Spark features

When both frameworks are combined, we get the power of enterprise-grade applications with in-memory performance, as shown in Figure 2.11:

Figure 2.11: Spark applications on the Hadoop platform

Frequently asked questions about Spark

The following are frequent questions that practitioners raise about Spark:

  • My dataset does not fit in-memory. How can I use Spark?

    Spark's operators spill the data to disk if it does not fit in-memory, allowing it to run on data of any size. Likewise, cached datasets that do not fit in-memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level. By default, Spark will recompute the partitions that don't fit in-memory. The storage level can be changed as MEMORY_AND_DISK to spill partitions to disk.

    Figure 2.12 shows you the performance difference between fully cached and on disk:

    Figure 2.12: Spark performance: Fully cached versus disk

  • How does fault recovery work in Spark?

    Spark's built-in fault tolerance based on the RDD lineage will automatically recover from failures. Figure 2.13 shows you the performance over failure in the 6th iteration in a k-means algorithm:

    Figure 2.13: Fault recovery performance

主站蜘蛛池模板: 南部县| 曲阜市| 安徽省| 金寨县| 长顺县| 甘德县| 汾阳市| 清流县| 拉孜县| 屏东市| 西和县| 尖扎县| 邢台市| 凤城市| 南投县| 山东省| 蓬莱市| 衡南县| 且末县| 新宁县| 沛县| 施甸县| 镇巴县| 司法| 兖州市| 泌阳县| 通榆县| 馆陶县| 章丘市| 华阴市| 连州市| 广德县| 聂荣县| 浦东新区| 伊春市| 沅江市| 三原县| 长春市| 瑞金市| 灵台县| 白河县|