官术网_书友最值得收藏!

Creating an Amazon EMR job flow using the AWS Command Line Interface

AWS Command Line Interface (CLI) is a tool that allows us to manage our AWS services from the command line. In this recipe, we use AWS CLI to manage Amazon EMR services.

This recipe creates an EMR job flow using the AWS CLI to execute the WordCount sample from the Running Hadoop MapReduce computations using Amazon Elastic MapReduce recipe of this chapter.

Getting ready

The following are the prerequisites to get started with this recipe:

  • Python 2.6.3 or higher
  • pip—Python package management system

How to do it...

The following steps show you how to create an EMR job flow using the EMR command-line interface:

  1. Install AWS CLI in your machine using the pip installer:
    $ sudo pip install awscli
    

    Note

    Refer to http://docs.aws.amazon.com/cli/latest/userguide/installing.html for more information on installing the AWS CLI. This guide provides instructions on installing AWS CLI without sudo as well as instructions on installing AWS CLI using alternate methods.

  2. Create an access key ID and a secret access key by logging in to the AWS IAM console (https://console.aws.amazon.com/iam). Download and save the key file in a safe location.
  3. Use the aws configure utility to configure your AWS account to the AWC CLI. Provide the access key ID and the secret access key you obtained in the previous step. This information would get stored in the .aws/config and .aws/credentials files in your home directory.
    $ aws configure
    AWS Access Key ID [None]: AKIA….
    AWS Secret Access Key [None]: GC…
    Default region name [None]: us-east-1a
    Default output format [None]: 
    

    Tip

    You can skip to step 7 if you have completed steps 2 to 6 of the Running Hadoop MapReduce computations using Amazon Elastic MapReduce recipe in this chapter.

  4. Create a bucket to upload the input data by clicking on Create Bucket in the Amazon S3 monitoring console (https://console.aws.amazon.com/s3). Provide a unique name for your bucket. Upload your input data to the newly-created bucket by selecting the bucket and clicking on Upload. The input data for the WordCount sample should be one or more text files.
  5. Create an S3 bucket to upload the JAR file needed for our MapReduce computation. Upload hcb-c1-samples.jar to the newly created bucket.
  6. Create an S3 bucket to store the output data of the computation. Create another S3 bucket to store the logs of the computation. Let's assume the name of this bucket is hcb-c2-logs.
  7. Create an EMR cluster by executing the following command. This command will output the cluster ID of the created EMR cluster:
    $ aws emr create-cluster --ami-version 3.1.0 \
    --log-uri s3://hcb-c2-logs \
    --instance-groups \
    InstanceGroupType=MASTER,InstanceCount=1,\
    InstanceType=m3.xlarge \
    InstanceGroupType=CORE,InstanceCount=2,\
    InstanceType=m3.xlarge
    {
     “ClusterId”: “j-2X9TDN6T041ZZ”
    }
    
  8. You can use the list-clusters command to check the status of the created EMR cluster:
    $ aws emr list-clusters
    {
     “Clusters”: [
     {
     “Status”: {
     “Timeline”: {
     “ReadyDateTime”: 1421128629.1830001,
     “CreationDateTime”: 1421128354.4130001
     },
     “State”: “WAITING”,
     “StateChangeReason”: {
     “Message”: “Waiting after step completed”
     }
     },
     “NormalizedInstanceHours”: 24,
     “Id”: “j-2X9TDN6T041ZZ”,
     “Name”: “Development Cluster”
     }
     ]
    }
    
  9. Add a job step to this EMR cluster by executing the following. Replace the paths of the JAR file, input data location, and the output data location with the locations you used in steps 5, 6, and 7. Replace cluster-id with the cluster ID of your newly created EMR cluster.
    $ aws emr add-steps \
    --cluster-id j-2X9TDN6T041ZZ \
    --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,\
    Jar=s3n://[S3 jar file bucket]/hcb-c1-samples.jar,\
    Args=chapter1.WordCount,\
    s3n://[S3 input data path]/*,\
    s3n://[S3 output data path]/wc-out
    {
     “StepIds”: [
     “s-1SEEPDZ99H3Y2”
     ]
    }
    
  10. Check the status of the submitted job step using the describe-step command as follows. You can also check the status and debug your job flow using the Amazon EMR console (https://console.aws.amazon.com/elasticmapreduce).
    $ aws emr describe-step \
    –cluster-id j-2X9TDN6T041ZZ \
    –step-id s-1SEEPDZ99H3Y2
    
  11. Once the job flow is completed, check the result of the computation in the output data location using the S3 console.
  12. Terminate the cluster using the terminate-clusters command:
    $ aws emr terminate-clusters --cluster-ids j-2X9TDN6T041ZZ
    

There's more...

You can use EC2 Spot Instances with your EMR clusters to reduce the cost of your computations. Add a bid price to your request by adding the --BidPrice parameter to the instance groups of your create-cluster command:

$ aws emr create-cluster --ami-version 3.1.0 \
--log-uri s3://hcb-c2-logs \
--instance-groups \
InstanceGroupType=MASTER,InstanceCount=1,\
InstanceType=m3.xlarge,BidPrice=0.10 \
InstanceGroupType=CORE,InstanceCount=2,\
InstanceType=m3.xlarge,BidPrice=0.10

Refer to the Saving money using Amazon EC2 Spot Instances to execute EMR job flows recipe in this chapter for more details on Amazon Spot Instances.

See also

主站蜘蛛池模板: 汉中市| 革吉县| 焉耆| 富宁县| 游戏| 光泽县| 万年县| 大冶市| 应城市| 叙永县| 泾阳县| 甘孜| 湄潭县| 新泰市| 义乌市| 邯郸市| 年辖:市辖区| 张掖市| 定远县| 庆城县| 监利县| 乳源| 邵东县| 稷山县| 安龙县| 盘锦市| 莱阳市| 东阿县| 华安县| 和平县| 紫金县| 松阳县| 白河县| 鸡西市| 崇仁县| 周至县| 神农架林区| 桂东县| 麻江县| 兰坪| 星座|