官术网_书友最值得收藏!

MapReduce output

The output is dependent on the number of Reduce tasks present in the job. Some guidelines to optimize outputs are as follows:

  • Compress outputs to save on storage. Compression also helps in increasing HDFS write throughput.
  • Avoid writing out-of-band side files as outputs in the Reduce task. If statistical data needs to be collected, the use of Counters is better. Collecting statistics in side files would require an additional step of aggregation.
  • Depending on the consumer of the output files of a job, a splittable compression technique could be appropriate.
  • Writing large HDFS files with larger block sizes can help subsequent consumers of the data reduce their Map tasks. This is particularly useful when we cascade MapReduce jobs. In such situations, the outputs of a job become the inputs to the next job. Writing large files with large block sizes eliminates the need for specialized processing of Map inputs in subsequent jobs.

Speculative execution of tasks

Stagglers are slow-running tasks that eventually complete successfully. A staggler Map task might not allow a Reduce task to start, thus delaying the completion of the job. Stagglers could be present because of hardware performance degradation or possible software misconfiguration.

Hadoop cannot automatically correct a staggler task but has the capability of identifying tasks that are running slower than normal. As a backup, it can spawn another equivalent task and use the results from the task that finishes first. The backup tasks can then be asked to terminate. This is termed speculative execution.

By default, Hadoop enables speculative execution. It can be turned off for Map tasks by setting mapreduce.map.speculative to false and for Reduce tasks by setting mapreduce.reduce.speculative to false.

The mapreduce.job.speculative.speculativecap is a property with values between 0 and 1, indicating the percentage of running tasks that can be speculatively executed. The default value of this property is 0.1. The mapreduce.job.speculative.slowtaskthreshold and mapreduce.job.speculative.slownodethreshold are two other configurable parameters whose values default to 1. They indicate how much slower the tasks should be executing than the average. They are measured in terms of standard deviation with respect to the average task progress rates.

主站蜘蛛池模板: 普安县| 余江县| 绿春县| 衡南县| 临西县| 玉环县| 四川省| 安岳县| 霍城县| 阿坝| 和林格尔县| 大同市| 友谊县| 博罗县| 龙川县| 临澧县| 天等县| 日照市| 黄浦区| 临沂市| 华安县| 宝山区| 蒲城县| 南川市| 马鞍山市| 克什克腾旗| 太仓市| 万荣县| 西贡区| 海口市| 临西县| 江永县| 宁河县| 汉寿县| 清水县| 于都县| 渝北区| 甘孜| 莱西市| 成都市| 仁布县|