- Mastering Hadoop
- Sandeep Karanth
- 241字
- 2021-08-06 19:53:01
Summary
In this chapter, you saw optimizations at different stages of the Hadoop MapReduce pipeline. With the join example, we saw a few other advanced features available for MapReduce jobs. Some key takeaways from this chapter are as follows:
- Too many Map tasks that are I/O bound should be avoided. Inputs dictate the number of Map tasks.
- Map tasks are primary contributors for job speedup due to parallelism.
- Combiners increase efficiency not only in data transfers between Map tasks and Reduce tasks, but also reduce disk I/O on the Map side.
- The default setting is a single Reduce task.
- Custom partitioners can be used for load balancing among Reducers.
- DistributedCache is useful for side file distribution of small files. Too many and too large files in the cache should be avoided.
- Custom counters should be used to track global job level statistics. But too many counters are bad.
- Compression should be used more often. Different compression techniques have different tradeoffs and the right technique is application-dependent.
- Hadoop has many tunable configuration knobs to optimize job execution.
- Premature optimizations should be avoided. Built-in counters are your friends.
- Higher-level abstractions such as Pig or Hive are recommended instead of bare metal Hadoop jobs.
In the next chapter, we will look at Pig, a framework to script MapReduce jobs on Hadoop. Pig provides higher-level relational operators that a user can employ to do data transformations, eliminating the need to write low-level MapReduce Java code.
推薦閱讀
- 中文版Photoshop CS5數(shù)碼照片處理完全自學(xué)一本通
- 控制與決策系統(tǒng)仿真
- Expert AWS Development
- VMware Performance and Capacity Management(Second Edition)
- PostgreSQL Administration Essentials
- AutoCAD 2012中文版繪圖設(shè)計(jì)高手速成
- Cloudera Administration Handbook
- Splunk Operational Intelligence Cookbook
- 大數(shù)據(jù)時(shí)代
- 網(wǎng)絡(luò)布線(xiàn)與小型局域網(wǎng)搭建
- 中國(guó)戰(zhàn)略性新興產(chǎn)業(yè)研究與發(fā)展·數(shù)控系統(tǒng)
- Hands-On Deep Learning with Go
- Windows 7故障與技巧200例
- 網(wǎng)絡(luò)安全概論
- Mastering DynamoDB