官术网_书友最值得收藏!

Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce

Amazon Elastic MapReduce (EMR) provides on-demand managed Hadoop clusters in the Amazon Web Services (AWS) cloud to perform your Hadoop MapReduce computations. EMR uses Amazon Elastic Compute Cloud (EC2) instances as the compute resources. EMR supports reading input data from Amazon Simple Storage Service (S3) and storing of the output data in Amazon S3 as well. EMR takes care of the provisioning of cloud instances, configuring the Hadoop cluster, and the execution of our MapReduce computational flows.

In this recipe, we are going to execute the WordCount MapReduce sample (the Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode recipe from Chapter 1, Getting Started with Hadoop v2) in the Amazon EC2 using the Amazon Elastic MapReduce service.

Getting ready

Build the hcb-c1-samples.jar file by running the Gradle build in the chapter1 folder of the sample code repository.

How to do it...

The following are the steps for executing WordCount MapReduce application on Amazon Elastic MapReduce:

  1. Sign up for an AWS account by visiting http://aws.amazon.com.
  2. Open the Amazon S3 monitoring console at https://console.aws.amazon.com/s3 and sign in.
  3. Create an S3 bucket to upload the input data by clicking on Create Bucket. Provide a unique name for your bucket. Let's assume the name of the bucket is wc-input-data. You can find more information on creating an S3 bucket at http://docs.amazonwebservices.com/AmazonS3/latest/gsg/CreatingABucket.html. There also exist several third-party desktop clients for the Amazon S3. You can use one of those clients to manage your data in S3 as well.
  4. Upload your input data to the bucket we just created by selecting the bucket and clicking on Upload. The input data for the WordCount sample should be one or more text files:
    How to do it...
  5. Create an S3 bucket to upload the JAR file needed for our MapReduce computation. Let's assume the name of the bucket as sample-jars. Upload hcb-c1-samples.jar to the newly created bucket.
  6. Create an S3 bucket to store the output data of the computation. Let's assume the name of this bucket as wc-output-data. Create another S3 bucket to store the logs of the computation. Let's assume the name of this bucket is hcb-c2-logs.

    Note

    Note that all the S3 users share the S3 bucket naming namespace. Hence, using the example bucket names given in this recipe might not work for you. In such scenarios, you should give your own custom names for the buckets and substitute those names in the subsequent steps of this recipe.

  7. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create Cluster button to create a new EMR cluster. Provide a name for your cluster.
  8. In the Log folder S3 location textbox, enter the path of the S3 bucket you created earlier to store the logs. Select the Enabled radio button for Debugging.
    How to do it...
  9. Select the Hadoop distribution and version in the Software Configuration section. Select AMI version 3.0.3 or above with the Amazon Hadoop distribution to deploy a Hadoop v2 cluster. Leave the default selected applications (Hive, Pig, and Hue) in the Application to be installed section.
  10. Select the EC2 instance types, instance counts, and the availability zone in the Hardware Configuration section. The default options use two EC2 m1.large instances for the Hadoop slave nodes and one EC2 m1.large instance for the Hadoop Master node.
    How to do it...
  11. Leave the default options in the Security and Access and Bootstrap Actions sections.
  12. Select the Custom Jar option under the Add Step dropdown of the Steps section. Click on Configure and add to configure the JAR file for our computation. Specify the S3 location of hcb-c1-samples.jar in the Jar S3 location textbox. You should specify the location of the JAR in the format s3n://bucket_name/jar_name. In the Arguments textbox, type chapter1.WordCount followed by the bucket location where you uploaded the input data in step 4 and the output data bucket you created in step 6. The output path should not exist and we use a directory (for example, wc-output-data/out1) inside the output bucket you created in step 6 as the output path. You should specify the locations using the format, s3n://bucket_name/path.
    How to do it...
  13. Click on Create Cluster to launch the EMR Hadoop cluster and run the WordCount application.

    Note

    Amazon will charge you for the compute and storage resources you use when clicking on Create Cluster in step 13. Refer to the Saving money using Amazon EC2 Spot Instances to execute EMR job flows recipe to find out how you can save money by using Amazon EC2 Spot Instances.

    Note that AWS bills you by the hour and any partial usage would get billed as an hour. Each launch and stop of an instance would be billed as a single hour, even if it takes only minutes. Be aware of the expenses when performing frequent re-launching of clusters for testing purposes.

  14. Monitor the progress of your MapReduce cluster deployment and the computation in the Cluster List | Cluster Details page of the Elastic MapReduce console. Expand the Steps section of the page to see the status of the individual steps of the cluster setup and the application execution. Select a step and click on View logs to view the logs and to debug the computation. Since EMR uploads the logfiles periodically, you might have to wait and refresh to access the logfiles. Check the output of the computation in the output data bucket using the AWS S3 console.
    How to do it...
  15. Terminate your cluster to avoid getting billed for the instances that are left. However, you may leave the cluster running to try out the other recipes in this chapter.

See also

  • The Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode recipe from Chapter 1, Getting Started with Hadoop v2
  • The Running the WordCount program in a distributed cluster environment recipe from Chapter 1, Getting Started with Hadoop v2
主站蜘蛛池模板: 鄂尔多斯市| 贞丰县| 浪卡子县| 中江县| 东乡族自治县| 禹城市| 卫辉市| 丹巴县| 景德镇市| 滁州市| 博野县| 乳源| 阿克苏市| 布尔津县| 玉屏| 晋中市| 甘谷县| 林芝县| 荔浦县| 铅山县| 绍兴市| 宁安市| 凤山市| 镇平县| 阜新市| 茂名市| 理塘县| 南城县| 孟村| 洛阳市| 攀枝花市| 安图县| 伊宁县| 德阳市| 绥中县| 京山县| 石嘴山市| 社旗县| 南充市| 绥滨县| 讷河市|