官术网_书友最值得收藏!

MapReduce job counters

Counters are entities that can collect statistics at a job level. They can help in quality control, performance monitoring, and problem identification in Hadoop MapReduce jobs. Since they are global in nature, unlike logs, they need not be aggregated to be analyzed. Counters are grouped into logical groups using the CounterGroup class. There are sets of built-in counters for each MapReduce job.

The following example illustrates the creation of simple custom counters to categorize lines into lines having zero words, lines with less than or equal to five words, and lines with more than five words. The program when run on the grant proposal subset files gives the following output:

14/04/13 23:27:00 INFO mapreduce.Job: Counters: 23
    File System Counters
        FILE: Number of bytes read=446021466
        FILE: Number of bytes written=114627807
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=535015319
        HDFS: Number of bytes written=52267476
        HDFS: Number of read operations=391608
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=195363
    Map-Reduce Framework
        Map input records=27862
        Map output records=27862
        Input split bytes=56007
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=66
        Total committed heap usage (bytes)=62037426176
    MasteringHadoop.MasteringHadoopCounters$WORDS_IN_LINE_COUNTER
        LESS_THAN_FIVE_WORDS=8449
        MORE_THAN_FIVE_WORDS=19413
        ZERO_WORDS=6766
    File Input Format Counters 
        Bytes Read=1817707
    File Output Format Counters 
        Bytes Written=189102

The first step in creating a counter is to define a Java enum with the names of the counters. The enum type name is the counter group as shown in the following snippet:

    public static enum WORDS_IN_LINE_COUNTER{
            ZERO_WORDS,
            LESS_THAN_FIVE_WORDS,
            MORE_THAN_FIVE_WORDS
        };

When a condition is encountered to increment the counter, it can be retrieved by passing the name of the counter to the getCounter() call in the context object of the task. Counters support an increment() method call to globally increment the value of the counter.

Tip

An application should not use more than 15 to 20 custom counters.

The getCounter() method in the context has a couple of other overloads. It can be used to create a dynamic counter by specifying a group and counter name at runtime.

The Mapper class, as given in the following code snippet, illustrates incrementing the WORDS_IN_LINE_COUNTER group counters based on the number of words in each sentence of a grant proposal:

public static class MasteringHadoopCountersMap extends Mapper<LongWritable, Text, LongWritable, IntWritable> {
private IntWritable countOfWords = new IntWritable(0); 
            @Override
            protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

                StringTokenizer tokenizer = new StringTokenizer(value.toString());
                int words = tokenizer.countTokens();
if(words == 0) context.getCounter(WORDS_IN_LINE_COUNTER.ZERO_WORDS).increment(1);
                if(words > 0 && words <= 5)
context.getCounter(WORDS_IN_LINE_COUNTER.LESS_THAN_FIVE_WORDS).increment(1);
                else
context.getCounter(WORDS_IN_LINE_COUNTER.MORE_THAN_FIVE_WORDS).increment(1);
countOfWords.set(words);
                context.write(key, countOfWords);
            }
        }

Counters are global variables in a distributed setting and have to be used prudently. The higher the number of counters, the more are the overheads on the framework that keeps track of them. Counters should not be used to aggregate very fine-grained statistics of an application.

主站蜘蛛池模板: 响水县| 垦利县| 韩城市| 定日县| 五原县| 龙山县| 融水| 新郑市| 乡宁县| 察隅县| 苏尼特左旗| 琼中| 辽阳市| 长垣县| 横山县| 巩留县| 冀州市| 固始县| 泽普县| 天津市| 遵义市| 铜鼓县| 海伦市| 布尔津县| 广南县| 金门县| 澜沧| 贡嘎县| 枣阳市| 武城县| 佳木斯市| 景泰县| 蕉岭县| 阿拉善右旗| 万山特区| 沽源县| 博兴县| 双城市| 河源市| 扎囊县| 外汇|