官术网_书友最值得收藏!

Executing the Map Reduce program in a Hadoop cluster

In the previous recipe, we took a look at how to write a map reduce program for a page view counter. In this recipe, we will explore how to execute this in a Hadoop cluster.

Getting ready

To perform this recipe, you should already have a running Hadoop cluster as well as an eclipse similar to an IDE.

How to do it

To execute the program, we first need to create a JAR file of it. JAR stands for Java Archive file, which contains compiled class files. To create a JAR file in eclipse, we need to perform the following steps:

  1. Right-click on the project where you've written your Map Reduce Program. Then, click on Export.
  2. Select Java->Jar File and click on the Next button. Browse through the path where you wish to export the JAR file, and provide a proper name to the jar file. Click on Finish to complete the creation of the JAR file.
  3. Now, copy this file to the Hadoop cluster. If you have your Hadoop cluster running in the AWS EC2 instance, you can use the following command to copy the JAR file:
    scp –i mykey.pem logAnalyzer.jar ubuntu@ec2-52-27-157-247.us-west-2.compute.amazonaws.com:/home/ubuntu
    
  4. If you don't already have your input log files in HDFS, use following commands:
    hadoop fs –mkdir /logs
    hadoop fs –put web.log /logs
    
  5. Now, it's time to execute the map reduce program. Use the following command to start the execution:
    hadoop jar logAnalyzer.jar com.demo.PageViewCounter /logs /pageview_output
    
  6. This will start the Map Reduce execution on your cluster. If everything goes well, you should be able to see output in the pageview_output folder in HDFS. Here, logAnalyzer is the name of the JAR file we created through eclipse. logs is the folder we have our input data in, while pageview_output is the folder that will first be created, and then results will be saved into. It is also important to provide a fully qualified name to the class along with its package name.

How it works...

Once the job is submitted, it first creates the Application Client and Application Master in the Hadoop cluster. The application tasks for Mapper are initiated in each node where data blocks are present in the Hadoop cluster. Once the Mapper phase is complete, the data is locally reduced by a combiner. Once the combiner finishes, the data is shuffled across the nodes in the cluster. Unless all the mappers have finished, reducers cannot be started. Output from the reducers is also written to HDFS in a specified folder.

Note

The output folder to be specified should be a nonexisting folder in HDFS. If the folder is already present, then the program will give you an error.

When all the tasks are finished for the application, you can take a look at the output in HDFS. The following are the commands to do this:

hadoop fs –ls /pageview_output
hadoop fs –cat /pageview_output/part-m-00000

This way, you can write similar programs for the following:

  • Most number of referral sites (hint: use a referral group from the matcher)
  • Number of client errors (with the Http status of 4XX)
  • Number of of server errors (with the Http status of 5XX)
主站蜘蛛池模板: 九龙城区| 左贡县| 江都市| 阳泉市| 革吉县| 庆安县| 崇州市| 和政县| 黄龙县| 贵德县| 阿克陶县| 泽州县| 白银市| 湖口县| 化州市| 宁明县| 綦江县| 平昌县| 兴文县| 扶绥县| 象山县| 双城市| 隆尧县| 江城| 城固县| 醴陵市| 宜丰县| 临湘市| 乌拉特中旗| 文登市| 繁峙县| 安顺市| 北宁市| 永寿县| 江源县| 平果县| 晋州市| 确山县| 曲沃县| 安远县| 禹州市|