官术网_书友最值得收藏!

Running a simple MapReduce program

In this recipe, we will look at how to make sense of the data stored on HDFS and extract useful information out of the files like the number of occurrences of a string, a pattern, or estimations, and various benchmarks. For this purpose, we can use MapReduce, which is a computation framework that helps us answer many questions we might have about the data.

With Hadoop, we can process huge amount of data. However, to get an understanding of its working, we'll start with a simple program such as pi estimation or a word count example.

ResourceManager is the master for Yet another Resource Negotiator (YARN). The Namenode stores the file metadata and the actual blocks/data reside on the slave nodes called Datanodes. All the jobs are submitted to the ResourceManager and it then assigns tasks to its slaves, called NodeManagers.

When a job is submitted to ResourceManager (RM), it will check for the job queue it is submitted to and whether the user has permissions to do so or not. Then it will ask Application Master launcher (AM Launcher) to launch an Application Master (AM) container on a node. Going forward, AM is the one responsible for running Map and Reduce containers with the help of inputs from RM.

AM takes care of failures and relaunches the containers if needed. Although there are many concepts, such as resource grant, AppManager, and retry logic, this being a recipe book we will keep the theory to the minimal. The following diagram shows the relationship between AM, RM, and NodeManager.

Running a simple MapReduce program

Getting ready

To step through the recipes in this chapter, make sure you have a running cluster with HDFS and YARN setup correctly as discussed in the previous chapters. This can be a single node cluster or a multinode cluster, as long as the cluster is configured correctly.

How to do it...

  1. Connect to an edge node in the cluster as this is the preferred way, but you can connect to any node in the cluster. If you are connecting to a Datanode, please go through Chapter 2, Maintain Hadoop Cluster HDFS to understand the disadvantages of doing so.
  2. Switch to user hadoop.
  3. Navigate to the directory were Hadoop is installed. Hadoop is bundled with many examples to run and test the cluster and get the user started.
  4. Under /opt/cluster/hadoop/share/hadoop/mapreduce/, there are lot of example JARs.
  5. To see all programs a JAR can run, execute the JAR without any arguments:
    yarn jar /opt/cluster/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar
  6. To run a pi estimation job, execute mapreduce-examples-2.7.2.jar, with arguments as shown in the following screenshot:
    How to do it...

Similarly, there are many other examples that can be run using the example JAR. To run a wordcount example, create a simple test file and copy it to HDFS and execute the program as follows:

yarn jar /opt/cluster/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /input_file /out_directory
  1. The JAR hadoop-mapreduce-examples-2.7.2.jar, is a bundled JAR, which has various classes to call such as wordcount, pi and many more. Each class we call takes some input and writes some output. We can write our own classes/jars and use just a section of the above bundled JAR.
主站蜘蛛池模板: 长兴县| 富民县| 天津市| 资阳市| 台南市| 顺平县| 澳门| 东乡| 兴隆县| 沙河市| 保定市| 玛沁县| 双桥区| 深泽县| 巴青县| 蓝山县| 洛宁县| 永修县| 霍林郭勒市| 利辛县| 称多县| 松阳县| 龙陵县| 高邮市| 白河县| 桑植县| 大宁县| 太和县| 太湖县| 石狮市| 会泽县| 嘉义市| 宁晋县| 江源县| 浦江县| 枞阳县| 浮梁县| 泗水县| 广昌县| 鹰潭市| 苏尼特左旗|