- Hadoop Real-World Solutions Cookbook(Second Edition)
- Tanmay Deshpande
- 536字
- 2021-07-09 20:02:51
Executing the Map Reduce program in a Hadoop cluster
In the previous recipe, we took a look at how to write a map reduce program for a page view counter. In this recipe, we will explore how to execute this in a Hadoop cluster.
Getting ready
To perform this recipe, you should already have a running Hadoop cluster as well as an eclipse similar to an IDE.
How to do it
To execute the program, we first need to create a JAR file of it. JAR stands for Java Archive file, which contains compiled class files. To create a JAR file in eclipse, we need to perform the following steps:
- Right-click on the project where you've written your Map Reduce Program. Then, click on Export.
- Select Java->Jar File and click on the Next button. Browse through the path where you wish to export the JAR file, and provide a proper name to the jar file. Click on Finish to complete the creation of the JAR file.
- Now, copy this file to the Hadoop cluster. If you have your Hadoop cluster running in the AWS EC2 instance, you can use the following command to copy the JAR file:
scp –i mykey.pem logAnalyzer.jar ubuntu@ec2-52-27-157-247.us-west-2.compute.amazonaws.com:/home/ubuntu
- If you don't already have your input log files in HDFS, use following commands:
hadoop fs –mkdir /logs hadoop fs –put web.log /logs
- Now, it's time to execute the map reduce program. Use the following command to start the execution:
hadoop jar logAnalyzer.jar com.demo.PageViewCounter /logs /pageview_output
- This will start the Map Reduce execution on your cluster. If everything goes well, you should be able to see output in the
pageview_output
folder in HDFS. Here,logAnalyzer
is the name of the JAR file we created through eclipse.logs
is the folder we have our input data in, whilepageview_output
is the folder that will first be created, and then results will be saved into. It is also important to provide a fully qualified name to the class along with its package name.
How it works...
Once the job is submitted, it first creates the Application Client and Application Master in the Hadoop cluster. The application tasks for Mapper are initiated in each node where data blocks are present in the Hadoop cluster. Once the Mapper phase is complete, the data is locally reduced by a combiner. Once the combiner finishes, the data is shuffled across the nodes in the cluster. Unless all the mappers have finished, reducers cannot be started. Output from the reducers is also written to HDFS in a specified folder.
Note
The output folder to be specified should be a nonexisting folder in HDFS. If the folder is already present, then the program will give you an error.
When all the tasks are finished for the application, you can take a look at the output in HDFS. The following are the commands to do this:
hadoop fs –ls /pageview_output hadoop fs –cat /pageview_output/part-m-00000
This way, you can write similar programs for the following:
- Most number of referral sites (hint: use a referral group from the matcher)
- Number of client errors (with the Http status of 4XX)
- Number of of server errors (with the Http status of 5XX)