官术网_书友最值得收藏!

Executing a Pig script using EMR

Amazon EMR supports executing Apache Pig scripts on the data stored in S3. Refer to the Pig-related recipes in Chapter 7, Hadoop Ecosystem II – Pig, HBase, Mahout, and Sqoop, for more details on using Apache Pig for data analysis.

In this recipe, we are going to execute a simple Pig script using Amazon EMR. This sample will use the Human Development Reports data (http://hdr.undp.org/en/statistics/data/) to print names of countries that have a GNI value greater than $2000 of gross national income per capita (GNI) sorted by GNI.

How to do it...

The following steps show you how to use a Pig script with Amazon Elastic MapReduce to process a dataset stored on Amazon S3:

  1. Use the Amazon S3 console to create a bucket in S3 to upload the input data. Upload the resources/hdi-data.csv file in the source repository for this chapter to the newly created bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file is hcb-c2-data/hdi-data.csv.
  2. Review the Pig script available in the resources/countryFilter-EMR.pig file of the source repository for this chapter. This script uses the STORE command to save the result in the filesystem. In addition, we parameterize the LOAD command of the Pig script by adding $INPUT as the input file and the store command by adding $OUTPUT as the output directory. These two parameters would be substituted by the S3 input and output locations we specify in step 5.
    A = LOAD '$INPUT' using PigStorage(',')  AS 
    (id:int, country:chararray, hdi:float, lifeex:int,
    mysch:int, eysch:int, gni:int);
    B = FILTER A BY gni > 2000;
    C = ORDER B BY gni;
    STORE C into '$OUTPUT';
  3. Use the Amazon S3 console to create a bucket in S3 to upload the Pig script. Upload the resources/countryFilter-EMR.pig script to the newly created bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file as hcb-c2-resources/countryFilter-EMR.pig.
  4. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create Cluster button to create a new EMR cluster. Provide a name for your cluster. Follow steps 8 to 11 of the Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce recipe to configure your cluster.

    Note

    You can reuse the EMR cluster you created in the Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce recipe to follow the steps of this recipe. To do that, use the Add Step option in the Cluster Details page of the running cluster to perform the actions mentioned in step 5.

  5. Select the Pig Program option under the Add Step dropdown of the Steps section. Click on Configure and add to configure the Pig script, input, and output data for our computation. Specify the S3 location of the Pig script we uploaded in step 3, in the Script S3 location textbox. You should specify the location of the script in the format s3://bucket_name/script_filename. Specify the S3 location of the uploaded input data file in the Input S3 Location textbox. In the Output S3 Location textbox, specify an S3 location to store the output. The output path should not exist; we use a non-existing directory (for example, hcb-c2-out/pig) inside the output bucket as the output path. You should specify the locations using the format s3://bucket_name/path. Click on Add.
    How to do it...
  6. Click on Create Cluster to launch the EMR Hadoop cluster and to run the configured Pig script.

    Note

    Amazon will charge you for the compute and storage resources you use by clicking on Create Job Flow in step 8. Refer to the Saving money using EC2 Spot Instances to execute EMR job flows recipe that we discussed earlier to find out how you can save money by using Amazon EC2 Spot Instances.

  7. Monitor the progress of your MapReduce cluster deployment and the computation in the Cluster List | Cluster Details page of the Elastic MapReduce console. Expand and refresh the Steps section of the page to see the status of the individual steps of the cluster setup and the application execution. Select a step and click on View logs to view the logs and to debug the computation. Check the output of the computation in the output data bucket using the AWS S3 console.

There's more...

Amazon EMR allows us to use Apache Pig in the interactive mode as well.

Starting a Pig interactive session

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create Cluster button to create a new EMR cluster. Provide a name for your cluster.
  2. You must select a key pair from the Amazon EC2 Key Pair dropdown in the Security and Access section. If you do not have a usable Amazon EC2 key pair with access to the private key, log on to the Amazon EC2 console and create a new key pair.
  3. Click on Create Cluster without specifying any steps. Make sure No is selected in the Auto-Terminate option under the Steps section.
  4. Monitor the progress of your MapReduce cluster deployment and the computation in the Cluster Details page under Cluster List of the Elastic MapReduce console. Retrieve Master Public DNS from the cluster details in this page.
    Starting a Pig interactive session
  5. Use the master public DNS name and the private key file of the Amazon EC2 key pair you specified in step 2 to SSH into the master node of the cluster:
    $ ssh -i <path-to-the-key-file> hadoop@<master-public-DNS>
    
  6. Start the Pig interactive Grunt shell in the master node and issue your Pig commands:
    $ pig
    .........
    grunt>
    
主站蜘蛛池模板: 抚顺县| 定州市| 肥乡县| 都江堰市| 濮阳市| 金堂县| 吉木萨尔县| 青神县| 青铜峡市| 庄河市| 来宾市| 望城县| 精河县| 孟村| 大方县| 巨鹿县| 西和县| 麟游县| 扎囊县| 临桂县| 钦州市| 朝阳市| 芮城县| 布尔津县| 阳原县| 黄大仙区| 通河县| 阿鲁科尔沁旗| 平邑县| 潮安县| 平山县| 芦山县| 徐汇区| 兴义市| 乌拉特前旗| 徐水县| 方城县| 玛纳斯县| 常州市| 渑池县| 阳东县|