官术网_书友最值得收藏!

  • Mastering Hadoop
  • Sandeep Karanth
  • 405字
  • 2021-08-06 19:52:59

Chapter 2. Advanced MapReduce

MapReduce is a programming model for parallel and distributed processing of data. It consists of two steps: Map and Reduce. These steps are inspired from functional programming, a branch of computer science that deals with mathematical functions as computational units. Properties of functions such as immutability and statelessness are attractive for parallel and distributed processing. They provide a high degree of parallelism and fault tolerance at lower costs and semantic complexity.

In this chapter, we will look at advanced optimizations when running MapReduce jobs on Hadoop clusters. Every MapReduce job has input data and a Map task per split of this data. The Map task calls a map function repeatedly on every record, represented as a key-value pair. The map is a function that transforms data from one domain to another. The intermediate output records of each Map task are shuffled and sorted before transferring it downstream to the Reduce tasks. Intermediate data with the same keys go to the same Reduce task. The Reduce task calls the reduce function for a key and all its associated values. Outputs are then collected and stored.

The Map step has the greatest degree of parallelism. It is used to implement operations such as filtering, sorting, and transformations on data. The Reduce step is used to implement summarization operations on data. Hadoop also provides features such as DistributedCache as a side channel to distribute data and Counters to collect job-related global statistics. We will be looking at their utility in processing MapReduce jobs.

The advanced features and optimizations will be explained with the help of examples of code. Hadoop 2.2.0 will be used throughout this chapter. It is assumed that you have access to the Java development environment and a Hadoop cluster, either in your organization, the cloud, or as a standalone/pseudo-distributed mode installation on your personal computers. You need to have knowledge on how to compile Java programs and run Hadoop jobs to try out the examples.

In this chapter, we will look at the following topics:

  • The different phases of a MapReduce job and the optimizations that can be applied at each phase. The input, Map, Shuffle/Sort, Reduce, and the output phases will be covered in depth with relevant examples.
  • The application of useful Hadoop features such as DistributedCache and Counters.
  • The types of data joins that can be achieved in a MapReduce job and the patterns to achieve them.
主站蜘蛛池模板: 乡城县| 吉水县| 平泉县| 平昌县| 岳池县| 隆化县| 南平市| 盐城市| 淮滨县| 湘潭县| 古田县| 龙井市| 瓦房店市| 华安县| 涞水县| 旬邑县| 新绛县| 太和县| 平泉县| 繁昌县| 金沙县| 洛隆县| 宜良县| 东乡| 桃园县| 香格里拉县| 林州市| 武汉市| 来凤县| 泰顺县| 环江| 琼结县| 九江市| 西丰县| 普洱| 昭平县| 长宁区| 保定市| 阿拉善右旗| 昌黎县| 凌云县|