官术网_书友最值得收藏!

Time for action – fixing WordCount to work with a combiner

Let's make the necessary modifications to WordCount to correctly use a combiner.

Copy WordCount2.java to a new file called WordCount3.java and change the reduce method as follows:

public void reduce(Text key, Iterable<IntWritable> values,            
Context context) throws IOException, InterruptedException 
{
int total = 0 ;
for (IntWritable val : values))
{
total+= val.get() ;
}
            context.write(key, new IntWritable(total));
}

Remember to also change the class name to WordCount3 and then compile, create the JAR file, and run the job as before.

What just happened?

The output is now as expected. Any map-side invocations of the combiner performs successfully and the reducer correctly produces the overall output value.

Tip

Would this have worked if the original reducer was used as the combiner and the new reduce implementation as the reducer? The answer is no, though our test example would not have demonstrated it. Because the combiner may be invoked multiple times on the map output data, the same errors would arise in the map output if the dataset was large enough, but didn't occur here due to the small input size. Fundamentally, the original reducer was incorrect, but this wasn't immediately obvious; watch out for such subtle logic flaws. This sort of issue can be really hard to debug as the code will reliably work on a development box with a subset of the data set and fail on the much larger operational cluster. Carefully craft your combiner classes and never rely on testing that only processes a small sample of the data.

Reuse is your friend

In the previous section we took the existing job class file and made changes to it. This is a small example of a very common Hadoop development workflow; use an existing job file as the starting point for a new one. Even if the actual mapper and reducer logic is very different, it's often a timesaver to take an existing working job as this helps you remember all the required elements of the mapper, reducer, and driver implementations.

Pop quiz – MapReduce mechanics

Q1. What do you always have to specify for a MapReduce job?

  1. The classes for the mapper and reducer.
  2. The classes for the mapper, reducer, and combiner.
  3. The classes for the mapper, reducer, partitioner, and combiner.
  4. None; all classes have default implementations.

Q2. How many times will a combiner be executed?

  1. At least once.
  2. Zero or one times.
  3. Zero, one, or many times.
  4. It's configurable.

Q3. You have a mapper that for each key produces an integer value and the following set of reduce operations:

  • Reducer A: outputs the sum of the set of integer values.
  • Reducer B: outputs the maximum of the set of values.
  • Reducer C: outputs the mean of the set of values.
  • Reducer D: outputs the difference between the largest and smallest values in the set.

Which of these reduce operations could safely be used as a combiner?

  1. All of them.
  2. A and B.
  3. A, B, and D.
  4. C and D.
  5. None of them.
主站蜘蛛池模板: 龙游县| 年辖:市辖区| 平南县| 泊头市| 龙门县| 孝感市| 阜宁县| 额济纳旗| 克拉玛依市| 滕州市| 英山县| 望都县| 枣阳市| 北川| 九龙坡区| 林周县| 广饶县| 封开县| 侯马市| 伊川县| 安多县| 峨边| 江川县| 张家口市| 临泉县| 沾化县| 南丹县| 庄浪县| 迁西县| 樟树市| 固原市| 准格尔旗| 南靖县| 浮山县| 平罗县| 沧源| 南投县| 崇文区| 青神县| 廉江市| 凌海市|