- Hadoop MapReduce v2 Cookbook(Second Edition)
- Thilina Gunarathne
- 823字
- 2021-07-23 20:32:55
Creating an Amazon EMR job flow using the AWS Command Line Interface
AWS Command Line Interface (CLI) is a tool that allows us to manage our AWS services from the command line. In this recipe, we use AWS CLI to manage Amazon EMR services.
This recipe creates an EMR job flow using the AWS CLI to execute the WordCount sample from the Running Hadoop MapReduce computations using Amazon Elastic MapReduce recipe of this chapter.
Getting ready
The following are the prerequisites to get started with this recipe:
- Python 2.6.3 or higher
- pip—Python package management system
How to do it...
The following steps show you how to create an EMR job flow using the EMR command-line interface:
- Install AWS CLI in your machine using the pip installer:
$ sudo pip install awscli
Note
Refer to http://docs.aws.amazon.com/cli/latest/userguide/installing.html for more information on installing the AWS CLI. This guide provides instructions on installing AWS CLI without
sudo
as well as instructions on installing AWS CLI using alternate methods. - Create an access key ID and a secret access key by logging in to the AWS IAM console (https://console.aws.amazon.com/iam). Download and save the key file in a safe location.
- Use the
aws configure
utility to configure your AWS account to the AWC CLI. Provide the access key ID and the secret access key you obtained in the previous step. This information would get stored in the.aws/config
and.aws/credentials
files in your home directory.$ aws configure AWS Access Key ID [None]: AKIA…. AWS Secret Access Key [None]: GC… Default region name [None]: us-east-1a Default output format [None]:
- Create a bucket to upload the input data by clicking on Create Bucket in the Amazon S3 monitoring console (https://console.aws.amazon.com/s3). Provide a unique name for your bucket. Upload your input data to the newly-created bucket by selecting the bucket and clicking on Upload. The input data for the WordCount sample should be one or more text files.
- Create an S3 bucket to upload the JAR file needed for our MapReduce computation. Upload
hcb-c1-samples.jar
to the newly created bucket. - Create an S3 bucket to store the output data of the computation. Create another S3 bucket to store the logs of the computation. Let's assume the name of this bucket is
hcb-c2-logs
. - Create an EMR cluster by executing the following command. This command will output the cluster ID of the created EMR cluster:
$ aws emr create-cluster --ami-version 3.1.0 \ --log-uri s3://hcb-c2-logs \ --instance-groups \ InstanceGroupType=MASTER,InstanceCount=1,\ InstanceType=m3.xlarge \ InstanceGroupType=CORE,InstanceCount=2,\ InstanceType=m3.xlarge { “ClusterId”: “j-2X9TDN6T041ZZ” }
- You can use the
list-clusters
command to check the status of the created EMR cluster:$ aws emr list-clusters { “Clusters”: [ { “Status”: { “Timeline”: { “ReadyDateTime”: 1421128629.1830001, “CreationDateTime”: 1421128354.4130001 }, “State”: “WAITING”, “StateChangeReason”: { “Message”: “Waiting after step completed” } }, “NormalizedInstanceHours”: 24, “Id”: “j-2X9TDN6T041ZZ”, “Name”: “Development Cluster” } ] }
- Add a job step to this EMR cluster by executing the following. Replace the paths of the JAR file, input data location, and the output data location with the locations you used in steps 5, 6, and 7. Replace
cluster-id
with the cluster ID of your newly created EMR cluster.$ aws emr add-steps \ --cluster-id j-2X9TDN6T041ZZ \ --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,\ Jar=s3n://[S3 jar file bucket]/hcb-c1-samples.jar,\ Args=chapter1.WordCount,\ s3n://[S3 input data path]/*,\ s3n://[S3 output data path]/wc-out { “StepIds”: [ “s-1SEEPDZ99H3Y2” ] }
- Check the status of the submitted job step using the
describe-step
command as follows. You can also check the status and debug your job flow using the Amazon EMR console (https://console.aws.amazon.com/elasticmapreduce).$ aws emr describe-step \ –cluster-id j-2X9TDN6T041ZZ \ –step-id s-1SEEPDZ99H3Y2
- Once the job flow is completed, check the result of the computation in the output data location using the S3 console.
- Terminate the cluster using the
terminate-clusters
command:$ aws emr terminate-clusters --cluster-ids j-2X9TDN6T041ZZ
There's more...
You can use EC2 Spot Instances with your EMR clusters to reduce the cost of your computations. Add a bid price to your request by adding the --BidPrice
parameter to the instance groups of your create-cluster
command:
$ aws emr create-cluster --ami-version 3.1.0 \ --log-uri s3://hcb-c2-logs \ --instance-groups \ InstanceGroupType=MASTER,InstanceCount=1,\ InstanceType=m3.xlarge,BidPrice=0.10 \ InstanceGroupType=CORE,InstanceCount=2,\ InstanceType=m3.xlarge,BidPrice=0.10
Refer to the Saving money using Amazon EC2 Spot Instances to execute EMR job flows recipe in this chapter for more details on Amazon Spot Instances.
See also
- The Running Hadoop MapReduce computations using Amazon Elastic MapReduce recipe of this chapter
- You can find the reference documentation for the EMR section of the AWS CLI at http://docs.aws.amazon.com/cli/latest/reference/emr/index.html
- 新編Visual Basic程序設計上機實驗教程
- Facebook Application Development with Graph API Cookbook
- Visual Basic 6.0程序設計計算機組裝與維修
- 騰訊iOS測試實踐
- 小創客玩轉圖形化編程
- Mastering Scientific Computing with R
- Access 2016數據庫管
- Android底層接口與驅動開發技術詳解
- Apache Spark 2.x for Java Developers
- Julia高性能科學計算(第2版)
- 零基礎Java學習筆記
- Android移動開發案例教程:基于Android Studio開發環境
- Extending Unity with Editor Scripting
- Java EE輕量級解決方案:S2SH
- 走近SDN/NFV