官术网_书友最值得收藏!

Executing the Map Reduce program in a Hadoop cluster

In the previous recipe, we took a look at how to write a map reduce program for a page view counter. In this recipe, we will explore how to execute this in a Hadoop cluster.

Getting ready

To perform this recipe, you should already have a running Hadoop cluster as well as an eclipse similar to an IDE.

How to do it

To execute the program, we first need to create a JAR file of it. JAR stands for Java Archive file, which contains compiled class files. To create a JAR file in eclipse, we need to perform the following steps:

  1. Right-click on the project where you've written your Map Reduce Program. Then, click on Export.
  2. Select Java->Jar File and click on the Next button. Browse through the path where you wish to export the JAR file, and provide a proper name to the jar file. Click on Finish to complete the creation of the JAR file.
  3. Now, copy this file to the Hadoop cluster. If you have your Hadoop cluster running in the AWS EC2 instance, you can use the following command to copy the JAR file:
    scp –i mykey.pem logAnalyzer.jar ubuntu@ec2-52-27-157-247.us-west-2.compute.amazonaws.com:/home/ubuntu
    
  4. If you don't already have your input log files in HDFS, use following commands:
    hadoop fs –mkdir /logs
    hadoop fs –put web.log /logs
    
  5. Now, it's time to execute the map reduce program. Use the following command to start the execution:
    hadoop jar logAnalyzer.jar com.demo.PageViewCounter /logs /pageview_output
    
  6. This will start the Map Reduce execution on your cluster. If everything goes well, you should be able to see output in the pageview_output folder in HDFS. Here, logAnalyzer is the name of the JAR file we created through eclipse. logs is the folder we have our input data in, while pageview_output is the folder that will first be created, and then results will be saved into. It is also important to provide a fully qualified name to the class along with its package name.

How it works...

Once the job is submitted, it first creates the Application Client and Application Master in the Hadoop cluster. The application tasks for Mapper are initiated in each node where data blocks are present in the Hadoop cluster. Once the Mapper phase is complete, the data is locally reduced by a combiner. Once the combiner finishes, the data is shuffled across the nodes in the cluster. Unless all the mappers have finished, reducers cannot be started. Output from the reducers is also written to HDFS in a specified folder.

Note

The output folder to be specified should be a nonexisting folder in HDFS. If the folder is already present, then the program will give you an error.

When all the tasks are finished for the application, you can take a look at the output in HDFS. The following are the commands to do this:

hadoop fs –ls /pageview_output
hadoop fs –cat /pageview_output/part-m-00000

This way, you can write similar programs for the following:

  • Most number of referral sites (hint: use a referral group from the matcher)
  • Number of client errors (with the Http status of 4XX)
  • Number of of server errors (with the Http status of 5XX)
主站蜘蛛池模板: 布尔津县| 天峨县| 临沧市| 敦煌市| 邵阳市| 遵义市| 彭州市| 陆河县| 沈丘县| 仙桃市| 台前县| 和林格尔县| 四会市| 平江县| 清丰县| 项城市| 南漳县| 英山县| 呼和浩特市| 临洮县| 西峡县| 栖霞市| 南康市| 河津市| 金平| 江北区| 株洲县| 通河县| 彭州市| 手游| 榆中县| 饶河县| 北碚区| 玉门市| 保定市| 乐至县| 绥中县| 西畴县| 岢岚县| 阿巴嘎旗| 灵宝市|