官术网_书友最值得收藏!

Writing the Map Reduce program in Java to analyze web log data

In this recipe, we are going to take a look at how to write a map reduce program to analyze web logs. Web logs are data that is generated by web servers for requests they receive. There are various web servers such as Apache, Nginx, Tomcat, and so on. Each web server logs data in a specific format. In this recipe, we are going to use data from the Apache Web Server, which is in combined access logs.

Note

To read more on combined access logs, refer to

http://httpd.apache.org/docs/1.3/logs.html#combined.

Getting ready

To perform this recipe, you should already have a running Hadoop cluster as well as an eclipse similar to an IDE.

How to do it...

We can write map reduce programs to analyze various aspects of web log data. In this recipe, we are going to write a map reduce program that reads a web log file, results pages, views, and their counts. Here is some sample web log data we'll consider as input for our program:

106.208.17.105 - - [12/Nov/2015:21:20:32 -0800] "GET /tutorials/mapreduce/advanced-map-reduce-examples-1.html HTTP/1.1" 200 0 "https://www.google.co.in/" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
60.250.32.153 - - [12/Nov/2015:21:42:14 -0800] "GET /tutorials/elasticsearch/install-elasticsearch-kibana-logstash-on-windows.html HTTP/1.1" 304 0 - "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36" 
49.49.250.23 - - [12/Nov/2015:21:40:56 -0800] "GET /tutorials/hadoop/images/internals-of-hdfs-file-read-operations/HDFS_Read_Write.png HTTP/1.1" 200 0 "http://hadooptutorials.co.in/tutorials/spark/install-apache-spark-on-ubuntu.html" "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; LCTE; rv:11.0) like Gecko"
60.250.32.153 - - [12/Nov/2015:21:36:01 -0800] "GET /tutorials/elasticsearch/install-elasticsearch-kibana-logstash-on-windows.html HTTP/1.1" 200 0 - "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
91.200.12.136 - - [12/Nov/2015:21:30:14 -0800] "GET /tutorials/hadoop/hadoop-fundamentals.html HTTP/1.1" 200 0 "http://hadooptutorials.co.in/tutorials/hadoop/hadoop-fundamentals.html" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.99 Safari/537.36"

These combined Apache Access logs are in a specific format. Here is the sequence and meaning of each component in each access log:

  • %h: This is the remote host (that is, the IP client)
  • %l: This is the identity of the user determined by an identifier (this is not usually used since it's not reliable)
  • %u: This is the username determined by the HTTP authentication
  • %t: This is the time the server takes to finish processing a request
  • %r: This is the request line from the client ("GET / HTTP/1.0")
  • %>s: This is the status code sent from a server to a client (200, 404, and so on)
  • %b: This is the size of the response given to a client (in bytes)
  • Referrer: This is the page that is linked to this URL
  • User agent: This is the browser identification string

Now, let's start a writing program in order to get to know the page views of each unique URL that we have in our web logs.

First, we will write a mapper class where we will read each and every line and parse it to the extract page URL. Here, we will use a Java pattern that matches a utility in order to extract information:

public static class PageViewMapper extends Mapper<Object, Text, Text, IntWritable> {
        public static String APACHE_ACCESS_LOGS_PATTERN = "^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(\\S+) (\\S+) (\\S+)\" (\\d{3}) (\\d+) (.+?) \"([^\"]+|(.+?))\"";

        public static Pattern pattern = Pattern.compile(APACHE_ACCESS_LOGS_PATTERN);

        private static final IntWritable one = new IntWritable(1);
        private Text url = new Text();

        public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context)
                throws IOException, InterruptedException {
        Matcher matcher = pattern.matcher(value.toString());
            if (matcher.matches()) {
                // Group 6 as we want only Page URL
                url.set(matcher.group(6));
                System.out.println(url.toString());
                context.write(this.url, one);
            }

        }
    }

In the preceding mapper class, we read key value pairs from the text file. By default, the key is a byte offset (the number of characters in a line), and the value is an actual line in a text file. Next, we match the line with the Apache Access Log regex pattern so that we can extract the exact information we need. For a page view counter, we only need a URL. Mapper outputs the URL as a key and 1 as the value. So, we can count these URL in reducer.

Here is the reducer class that sums up the output values of the mapper class:

public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values,
                Reducer<Text, IntWritable, Text, IntWritable>.Context context)
                        throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            this.result.set(sum);
            context.write(key, this.result);
        }
    }

Now, we just need a driver class to call these mappers and reducers:

public class PageViewCounter {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        if (args.length != 2) {
            System.err.println("Usage: PageViewCounter <in><out>");
            System.exit(2);
        }
        Job job = Job.getInstance(conf, "Page View Counter");
        job.setJarByClass(PageViewCounter.class);
        job.setMapperClass(PageViewMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));

        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

As the operation we are performing is aggregation, we can also use a combiner here to optimize the results. Here, the same reducer logic is being used as the one used for the combiner.

To compile your program properly, you need to add two external JARs, hadoop-common-2.7.jar, which can be found in the /usr/local/hadoop/share/hadoop/common folder and hadoop-mapreduce-client-core-2.7.jar, which can be found in the /usr/local/hadoop/share/hadoop/mapreduce path.

Make sure you add these two JARs in your build path so that your program can be compiled easily.

How it works...

The page view counter program helps us find the most popular pages, least accessed pages, and so on. Such information helps us make decisions about the ranking of pages, frequency of visits, and the relevance of a page. When a program is executed, each line of the HDFS block is read inpidually and then sent to Mapper. Mapper matches the input line with the log format and extracts its page URL. Mapper emits the (URL,1) type of key value pairs. These pairs are shuffled across nodes and partitioners to make sure that a similar URL goes to only one reducer. Once received by the reducers, we add up all the values for each key and emit them. This way, we get results in the form of a URL and the number of times it was accessed.

主站蜘蛛池模板: 化州市| 鄂托克旗| 宣城市| 达尔| 太谷县| 文水县| 鞍山市| 建昌县| 哈尔滨市| 锦州市| 吉木乃县| 杭锦后旗| 垣曲县| 泾川县| 榕江县| 上杭县| 崇礼县| 五常市| 西吉县| 孝昌县| 太仆寺旗| 麟游县| 宿州市| 林芝县| 宁安市| 和平县| 西藏| 建瓯市| 那曲县| 石景山区| 尉氏县| 广安市| 张家口市| 嘉黎县| 黄大仙区| 永新县| 乡城县| 定西市| 德惠市| 阿拉善盟| 左权县|