官术网_书友最值得收藏!

Creating an Amazon EMR job flow using the AWS Command Line Interface

AWS Command Line Interface (CLI) is a tool that allows us to manage our AWS services from the command line. In this recipe, we use AWS CLI to manage Amazon EMR services.

This recipe creates an EMR job flow using the AWS CLI to execute the WordCount sample from the Running Hadoop MapReduce computations using Amazon Elastic MapReduce recipe of this chapter.

Getting ready

The following are the prerequisites to get started with this recipe:

  • Python 2.6.3 or higher
  • pip—Python package management system

How to do it...

The following steps show you how to create an EMR job flow using the EMR command-line interface:

  1. Install AWS CLI in your machine using the pip installer:
    $ sudo pip install awscli
    

    Note

    Refer to http://docs.aws.amazon.com/cli/latest/userguide/installing.html for more information on installing the AWS CLI. This guide provides instructions on installing AWS CLI without sudo as well as instructions on installing AWS CLI using alternate methods.

  2. Create an access key ID and a secret access key by logging in to the AWS IAM console (https://console.aws.amazon.com/iam). Download and save the key file in a safe location.
  3. Use the aws configure utility to configure your AWS account to the AWC CLI. Provide the access key ID and the secret access key you obtained in the previous step. This information would get stored in the .aws/config and .aws/credentials files in your home directory.
    $ aws configure
    AWS Access Key ID [None]: AKIA….
    AWS Secret Access Key [None]: GC…
    Default region name [None]: us-east-1a
    Default output format [None]: 
    

    Tip

    You can skip to step 7 if you have completed steps 2 to 6 of the Running Hadoop MapReduce computations using Amazon Elastic MapReduce recipe in this chapter.

  4. Create a bucket to upload the input data by clicking on Create Bucket in the Amazon S3 monitoring console (https://console.aws.amazon.com/s3). Provide a unique name for your bucket. Upload your input data to the newly-created bucket by selecting the bucket and clicking on Upload. The input data for the WordCount sample should be one or more text files.
  5. Create an S3 bucket to upload the JAR file needed for our MapReduce computation. Upload hcb-c1-samples.jar to the newly created bucket.
  6. Create an S3 bucket to store the output data of the computation. Create another S3 bucket to store the logs of the computation. Let's assume the name of this bucket is hcb-c2-logs.
  7. Create an EMR cluster by executing the following command. This command will output the cluster ID of the created EMR cluster:
    $ aws emr create-cluster --ami-version 3.1.0 \
    --log-uri s3://hcb-c2-logs \
    --instance-groups \
    InstanceGroupType=MASTER,InstanceCount=1,\
    InstanceType=m3.xlarge \
    InstanceGroupType=CORE,InstanceCount=2,\
    InstanceType=m3.xlarge
    {
     “ClusterId”: “j-2X9TDN6T041ZZ”
    }
    
  8. You can use the list-clusters command to check the status of the created EMR cluster:
    $ aws emr list-clusters
    {
     “Clusters”: [
     {
     “Status”: {
     “Timeline”: {
     “ReadyDateTime”: 1421128629.1830001,
     “CreationDateTime”: 1421128354.4130001
     },
     “State”: “WAITING”,
     “StateChangeReason”: {
     “Message”: “Waiting after step completed”
     }
     },
     “NormalizedInstanceHours”: 24,
     “Id”: “j-2X9TDN6T041ZZ”,
     “Name”: “Development Cluster”
     }
     ]
    }
    
  9. Add a job step to this EMR cluster by executing the following. Replace the paths of the JAR file, input data location, and the output data location with the locations you used in steps 5, 6, and 7. Replace cluster-id with the cluster ID of your newly created EMR cluster.
    $ aws emr add-steps \
    --cluster-id j-2X9TDN6T041ZZ \
    --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,\
    Jar=s3n://[S3 jar file bucket]/hcb-c1-samples.jar,\
    Args=chapter1.WordCount,\
    s3n://[S3 input data path]/*,\
    s3n://[S3 output data path]/wc-out
    {
     “StepIds”: [
     “s-1SEEPDZ99H3Y2”
     ]
    }
    
  10. Check the status of the submitted job step using the describe-step command as follows. You can also check the status and debug your job flow using the Amazon EMR console (https://console.aws.amazon.com/elasticmapreduce).
    $ aws emr describe-step \
    –cluster-id j-2X9TDN6T041ZZ \
    –step-id s-1SEEPDZ99H3Y2
    
  11. Once the job flow is completed, check the result of the computation in the output data location using the S3 console.
  12. Terminate the cluster using the terminate-clusters command:
    $ aws emr terminate-clusters --cluster-ids j-2X9TDN6T041ZZ
    

There's more...

You can use EC2 Spot Instances with your EMR clusters to reduce the cost of your computations. Add a bid price to your request by adding the --BidPrice parameter to the instance groups of your create-cluster command:

$ aws emr create-cluster --ami-version 3.1.0 \
--log-uri s3://hcb-c2-logs \
--instance-groups \
InstanceGroupType=MASTER,InstanceCount=1,\
InstanceType=m3.xlarge,BidPrice=0.10 \
InstanceGroupType=CORE,InstanceCount=2,\
InstanceType=m3.xlarge,BidPrice=0.10

Refer to the Saving money using Amazon EC2 Spot Instances to execute EMR job flows recipe in this chapter for more details on Amazon Spot Instances.

See also

主站蜘蛛池模板: 吴川市| 澜沧| 祁阳县| 宜君县| 长岭县| 锡林郭勒盟| 周至县| 乐亭县| 永州市| 合江县| 大埔县| 广元市| 广昌县| 岳阳县| 施甸县| 腾冲县| 凌海市| 景泰县| 同德县| 辽源市| 尚义县| 苗栗县| 永清县| 璧山县| 双城市| 松原市| 黄陵县| 双柏县| 兴安盟| 平乡县| 金沙县| 洛宁县| 营口市| 乌什县| 婺源县| 昌黎县| 密山市| 家居| 施甸县| 中牟县| 澎湖县|