官术网_书友最值得收藏!

Creating an Amazon EMR job flow using the AWS Command Line Interface

AWS Command Line Interface (CLI) is a tool that allows us to manage our AWS services from the command line. In this recipe, we use AWS CLI to manage Amazon EMR services.

This recipe creates an EMR job flow using the AWS CLI to execute the WordCount sample from the Running Hadoop MapReduce computations using Amazon Elastic MapReduce recipe of this chapter.

Getting ready

The following are the prerequisites to get started with this recipe:

  • Python 2.6.3 or higher
  • pip—Python package management system

How to do it...

The following steps show you how to create an EMR job flow using the EMR command-line interface:

  1. Install AWS CLI in your machine using the pip installer:
    $ sudo pip install awscli
    

    Note

    Refer to http://docs.aws.amazon.com/cli/latest/userguide/installing.html for more information on installing the AWS CLI. This guide provides instructions on installing AWS CLI without sudo as well as instructions on installing AWS CLI using alternate methods.

  2. Create an access key ID and a secret access key by logging in to the AWS IAM console (https://console.aws.amazon.com/iam). Download and save the key file in a safe location.
  3. Use the aws configure utility to configure your AWS account to the AWC CLI. Provide the access key ID and the secret access key you obtained in the previous step. This information would get stored in the .aws/config and .aws/credentials files in your home directory.
    $ aws configure
    AWS Access Key ID [None]: AKIA….
    AWS Secret Access Key [None]: GC…
    Default region name [None]: us-east-1a
    Default output format [None]: 
    

    Tip

    You can skip to step 7 if you have completed steps 2 to 6 of the Running Hadoop MapReduce computations using Amazon Elastic MapReduce recipe in this chapter.

  4. Create a bucket to upload the input data by clicking on Create Bucket in the Amazon S3 monitoring console (https://console.aws.amazon.com/s3). Provide a unique name for your bucket. Upload your input data to the newly-created bucket by selecting the bucket and clicking on Upload. The input data for the WordCount sample should be one or more text files.
  5. Create an S3 bucket to upload the JAR file needed for our MapReduce computation. Upload hcb-c1-samples.jar to the newly created bucket.
  6. Create an S3 bucket to store the output data of the computation. Create another S3 bucket to store the logs of the computation. Let's assume the name of this bucket is hcb-c2-logs.
  7. Create an EMR cluster by executing the following command. This command will output the cluster ID of the created EMR cluster:
    $ aws emr create-cluster --ami-version 3.1.0 \
    --log-uri s3://hcb-c2-logs \
    --instance-groups \
    InstanceGroupType=MASTER,InstanceCount=1,\
    InstanceType=m3.xlarge \
    InstanceGroupType=CORE,InstanceCount=2,\
    InstanceType=m3.xlarge
    {
     “ClusterId”: “j-2X9TDN6T041ZZ”
    }
    
  8. You can use the list-clusters command to check the status of the created EMR cluster:
    $ aws emr list-clusters
    {
     “Clusters”: [
     {
     “Status”: {
     “Timeline”: {
     “ReadyDateTime”: 1421128629.1830001,
     “CreationDateTime”: 1421128354.4130001
     },
     “State”: “WAITING”,
     “StateChangeReason”: {
     “Message”: “Waiting after step completed”
     }
     },
     “NormalizedInstanceHours”: 24,
     “Id”: “j-2X9TDN6T041ZZ”,
     “Name”: “Development Cluster”
     }
     ]
    }
    
  9. Add a job step to this EMR cluster by executing the following. Replace the paths of the JAR file, input data location, and the output data location with the locations you used in steps 5, 6, and 7. Replace cluster-id with the cluster ID of your newly created EMR cluster.
    $ aws emr add-steps \
    --cluster-id j-2X9TDN6T041ZZ \
    --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,\
    Jar=s3n://[S3 jar file bucket]/hcb-c1-samples.jar,\
    Args=chapter1.WordCount,\
    s3n://[S3 input data path]/*,\
    s3n://[S3 output data path]/wc-out
    {
     “StepIds”: [
     “s-1SEEPDZ99H3Y2”
     ]
    }
    
  10. Check the status of the submitted job step using the describe-step command as follows. You can also check the status and debug your job flow using the Amazon EMR console (https://console.aws.amazon.com/elasticmapreduce).
    $ aws emr describe-step \
    –cluster-id j-2X9TDN6T041ZZ \
    –step-id s-1SEEPDZ99H3Y2
    
  11. Once the job flow is completed, check the result of the computation in the output data location using the S3 console.
  12. Terminate the cluster using the terminate-clusters command:
    $ aws emr terminate-clusters --cluster-ids j-2X9TDN6T041ZZ
    

There's more...

You can use EC2 Spot Instances with your EMR clusters to reduce the cost of your computations. Add a bid price to your request by adding the --BidPrice parameter to the instance groups of your create-cluster command:

$ aws emr create-cluster --ami-version 3.1.0 \
--log-uri s3://hcb-c2-logs \
--instance-groups \
InstanceGroupType=MASTER,InstanceCount=1,\
InstanceType=m3.xlarge,BidPrice=0.10 \
InstanceGroupType=CORE,InstanceCount=2,\
InstanceType=m3.xlarge,BidPrice=0.10

Refer to the Saving money using Amazon EC2 Spot Instances to execute EMR job flows recipe in this chapter for more details on Amazon Spot Instances.

See also

主站蜘蛛池模板: 法库县| 海南省| 湖北省| 峡江县| 台中县| 巴东县| 昆山市| 平江县| 蒲城县| 镇坪县| 香港 | 福泉市| 台北县| 合阳县| 五原县| 天峻县| 和静县| 贺兰县| 驻马店市| 柘荣县| 北海市| 勐海县| 柳州市| 武陟县| 丰台区| 聂荣县| 修水县| 宁海县| 云南省| 福安市| 大冶市| 伽师县| 滨州市| 六安市| 凤山市| 洛宁县| 西宁市| 台前县| 南华县| 贺兰县| 博野县|