- Hadoop MapReduce v2 Cookbook(Second Edition)
- Thilina Gunarathne
- 970字
- 2021-07-23 20:32:54
Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce
Amazon Elastic MapReduce (EMR) provides on-demand managed Hadoop clusters in the Amazon Web Services (AWS) cloud to perform your Hadoop MapReduce computations. EMR uses Amazon Elastic Compute Cloud (EC2) instances as the compute resources. EMR supports reading input data from Amazon Simple Storage Service (S3) and storing of the output data in Amazon S3 as well. EMR takes care of the provisioning of cloud instances, configuring the Hadoop cluster, and the execution of our MapReduce computational flows.
In this recipe, we are going to execute the WordCount MapReduce sample (the Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode recipe from Chapter 1, Getting Started with Hadoop v2) in the Amazon EC2 using the Amazon Elastic MapReduce service.
Getting ready
Build the hcb-c1-samples.jar
file by running the Gradle build in the chapter1
folder of the sample code repository.
How to do it...
The following are the steps for executing WordCount MapReduce application on Amazon Elastic MapReduce:
- Sign up for an AWS account by visiting http://aws.amazon.com.
- Open the Amazon S3 monitoring console at https://console.aws.amazon.com/s3 and sign in.
- Create an S3 bucket to upload the input data by clicking on Create Bucket. Provide a unique name for your bucket. Let's assume the name of the bucket is
wc-input-data
. You can find more information on creating an S3 bucket at http://docs.amazonwebservices.com/AmazonS3/latest/gsg/CreatingABucket.html. There also exist several third-party desktop clients for the Amazon S3. You can use one of those clients to manage your data in S3 as well. - Upload your input data to the bucket we just created by selecting the bucket and clicking on Upload. The input data for the WordCount sample should be one or more text files:
- Create an S3 bucket to upload the JAR file needed for our MapReduce computation. Let's assume the name of the bucket as
sample-jars
. Uploadhcb-c1-samples.jar
to the newly created bucket. - Create an S3 bucket to store the output data of the computation. Let's assume the name of this bucket as
wc-output-data
. Create another S3 bucket to store the logs of the computation. Let's assume the name of this bucket ishcb-c2-logs
. - Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create Cluster button to create a new EMR cluster. Provide a name for your cluster.
- In the Log folder S3 location textbox, enter the path of the S3 bucket you created earlier to store the logs. Select the Enabled radio button for Debugging.
- Select the Hadoop distribution and version in the Software Configuration section. Select AMI version 3.0.3 or above with the Amazon Hadoop distribution to deploy a Hadoop v2 cluster. Leave the default selected applications (Hive, Pig, and Hue) in the Application to be installed section.
- Select the EC2 instance types, instance counts, and the availability zone in the Hardware Configuration section. The default options use two EC2 m1.large instances for the Hadoop slave nodes and one EC2 m1.large instance for the Hadoop Master node.
- Leave the default options in the Security and Access and Bootstrap Actions sections.
- Select the Custom Jar option under the Add Step dropdown of the Steps section. Click on Configure and add to configure the JAR file for our computation. Specify the S3 location of
hcb-c1-samples.jar
in the Jar S3 location textbox. You should specify the location of the JAR in the formats3n://bucket_name/jar_name
. In the Arguments textbox, typechapter1.WordCount
followed by the bucket location where you uploaded the input data in step 4 and the output data bucket you created in step 6. The output path should not exist and we use a directory (for example,wc-output-data/out1
) inside the output bucket you created in step 6 as the output path. You should specify the locations using the format,s3n://bucket_name/path
. - Click on Create Cluster to launch the EMR Hadoop cluster and run the WordCount application.
Note
Amazon will charge you for the compute and storage resources you use when clicking on Create Cluster in step 13. Refer to the Saving money using Amazon EC2 Spot Instances to execute EMR job flows recipe to find out how you can save money by using Amazon EC2 Spot Instances.
Note that AWS bills you by the hour and any partial usage would get billed as an hour. Each launch and stop of an instance would be billed as a single hour, even if it takes only minutes. Be aware of the expenses when performing frequent re-launching of clusters for testing purposes.
- Monitor the progress of your MapReduce cluster deployment and the computation in the Cluster List | Cluster Details page of the Elastic MapReduce console. Expand the Steps section of the page to see the status of the individual steps of the cluster setup and the application execution. Select a step and click on View logs to view the logs and to debug the computation. Since EMR uploads the logfiles periodically, you might have to wait and refresh to access the logfiles. Check the output of the computation in the output data bucket using the AWS S3 console.
- Terminate your cluster to avoid getting billed for the instances that are left. However, you may leave the cluster running to try out the other recipes in this chapter.
- INSTANT Mock Testing with PowerMock
- Learning Apex Programming
- Mastering Natural Language Processing with Python
- ASP.NET Core 2 and Vue.js
- Python王者歸來
- Access 2010數據庫基礎與應用項目式教程(第3版)
- Python程序設計案例教程
- Unity 2017 Mobile Game Development
- Android Development Tools for Eclipse
- HTML+CSS+JavaScript網頁制作:從入門到精通(第4版)
- ASP.NET Web API Security Essentials
- Photoshop智能手機APP界面設計
- 3ds Max 2018從入門到精通
- 實驗編程:PsychoPy從入門到精通
- 你好!Java