官术网_书友最值得收藏!

Making the most of Hadoop and Spark

People generally get confused between Hadoop and Spark and how they are related. The intention of this section is to discuss the differences between Hadoop and Spark, and also how they can be used together.

Hadoop is mainly a combination of the following components:

  • Hive and Pig
  • MapReduce
  • YARN
  • HDFS

HDFS is the storage layer where underlying data can be stored. HDFS provides features such as the replication of the data, fault tolerance, high availability, and more. Hadoop is schema-on-read; for instance, you don’t have to specify the schema while writing the data to Hadoop, rather, you can use different schemas while reading the data. HDFS also provides different types of files formats, such as TextInputFormat, SequenceFile, NLInputFormat, and more. If you want to know more about these file formats, I would recommend reading Hadoop: The Definitive Guide by Tom White.

Hadoop’s MapReduce is a programming model used to process the data available on HDFS. It consists of four main phases: Map, Sort, Shuffle, and Reduce. One of the main differences between Hadoop and Spark is that Hadoop’s MapReduce model is tightly coupled with the file formats of the data. On the other hand, Spark provides an abstraction to process the data using RDD. RDD is like a general-purpose container of distributed data. That’s why Spark can integrate with a variety of data stores.

Another main difference between Hadoop and Spark is that Spark makes good use of memory. It can cache data in memory to avoid disk I/O. On the other hand, Hadoop’s MapReduce jobs generally involve multiple disks I/O. Typically, a Hadoop job consists of multiple Map and Reduce jobs. This is known as MapReduce chaining. A MapReduce chain may look something like this: Map -> Reduce -> Map -> Map -> Reduce.

All of the reduce jobs write their output to HDFS for reliability; therefore, each map task next to it will have to read it from HDFS. This involves multiple disk I/O operations and makes overall processing slower. There have been several initiatives such as Tez within Hadoop to optimize MapReduce processing. As discussed earlier, Spark creates a DAG of operations and automatically optimizes the disk reads.

Apart from the previous differences, Spark complements Hadoop by providing another way of processing the data. As discussed earlier in this chapter, it integrates well with Hadoop components such as Hive, YARN, and HDFS. The following diagram shows a typical Spark and Hadoop ecosystem looks like. Spark makes use of YARN for scheduling and running its task throughout the cluster:

Spark and Hadoop
主站蜘蛛池模板: 西乡县| 美姑县| 陇南市| 虞城县| 安龙县| 柯坪县| 晋江市| 门头沟区| 古蔺县| 邵阳县| 德安县| 皮山县| 洛阳市| 新泰市| 清河县| 新丰县| 延寿县| 商都县| 绥江县| 柳河县| 富顺县| 徐州市| 来宾市| 藁城市| 婺源县| 云南省| 武穴市| 彝良县| 通海县| 广西| 郧西县| 嘉黎县| 凉山| 五峰| 轮台县| 五寨县| 溧阳市| 桦川县| 恭城| 行唐县| 特克斯县|