官术网_书友最值得收藏!

Summary

In this chapter, you saw optimizations at different stages of the Hadoop MapReduce pipeline. With the join example, we saw a few other advanced features available for MapReduce jobs. Some key takeaways from this chapter are as follows:

  • Too many Map tasks that are I/O bound should be avoided. Inputs dictate the number of Map tasks.
  • Map tasks are primary contributors for job speedup due to parallelism.
  • Combiners increase efficiency not only in data transfers between Map tasks and Reduce tasks, but also reduce disk I/O on the Map side.
  • The default setting is a single Reduce task.
  • Custom partitioners can be used for load balancing among Reducers.
  • DistributedCache is useful for side file distribution of small files. Too many and too large files in the cache should be avoided.
  • Custom counters should be used to track global job level statistics. But too many counters are bad.
  • Compression should be used more often. Different compression techniques have different tradeoffs and the right technique is application-dependent.
  • Hadoop has many tunable configuration knobs to optimize job execution.
  • Premature optimizations should be avoided. Built-in counters are your friends.
  • Higher-level abstractions such as Pig or Hive are recommended instead of bare metal Hadoop jobs.

In the next chapter, we will look at Pig, a framework to script MapReduce jobs on Hadoop. Pig provides higher-level relational operators that a user can employ to do data transformations, eliminating the need to write low-level MapReduce Java code.

主站蜘蛛池模板: 项城市| 观塘区| 连云港市| 怀集县| 康定县| 和田县| 汽车| 五家渠市| 西乌珠穆沁旗| 汝州市| 凤台县| 西乌珠穆沁旗| 石林| 唐河县| 铁岭市| 白朗县| 京山县| 桐柏县| 阳高县| 西昌市| 宁德市| 湾仔区| 临邑县| 香河县| 威远县| 吴江市| 乐至县| 阳高县| 白河县| 阳朔县| 巩义市| 冕宁县| 阿坝| 北宁市| 竹山县| 马关县| 甘肃省| 南皮县| 鸡泽县| 巴林右旗| 偃师市|