cq9游戏维护多久

書名： Hadoop MapReduce v2 Cookbook（Second Edition）
作者名： Thilina Gunarathne
本章字數(shù)： 922字
更新時間： 2021-07-23 20:32:55

Executing a Pig script using EMR

Amazon EMR supports executing Apache Pig scripts on the data stored in S3. Refer to the Pig-related recipes in Chapter 7, Hadoop Ecosystem II – Pig, HBase, Mahout, and Sqoop, for more details on using Apache Pig for data analysis.

In this recipe, we are going to execute a simple Pig script using Amazon EMR. This sample will use the Human Development Reports data (http://hdr.undp.org/en/statistics/data/) to print names of countries that have a GNI value greater than $2000 of gross national income per capita (GNI) sorted by GNI.

How to do it...

The following steps show you how to use a Pig script with Amazon Elastic MapReduce to process a dataset stored on Amazon S3:

Use the Amazon S3 console to create a bucket in S3 to upload the input data. Upload the resources/hdi-data.csv file in the source repository for this chapter to the newly created bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file is hcb-c2-data/hdi-data.csv.
Review the Pig script available in the resources/countryFilter-EMR.pig file of the source repository for this chapter. This script uses the STORE command to save the result in the filesystem. In addition, we parameterize the LOAD command of the Pig script by adding $INPUT as the input file and the store command by adding $OUTPUT as the output directory. These two parameters would be substituted by the S3 input and output locations we specify in step 5.
```
A = LOAD '$INPUT' using PigStorage(',')  AS 
(id:int, country:chararray, hdi:float, lifeex:int,
mysch:int, eysch:int, gni:int);
B = FILTER A BY gni > 2000;
C = ORDER B BY gni;
STORE C into '$OUTPUT';
```
Use the Amazon S3 console to create a bucket in S3 to upload the Pig script. Upload the resources/countryFilter-EMR.pig script to the newly created bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file as hcb-c2-resources/countryFilter-EMR.pig.
Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create Cluster button to create a new EMR cluster. Provide a name for your cluster. Follow steps 8 to 11 of the Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce recipe to configure your cluster.

Note

You can reuse the EMR cluster you created in the Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce recipe to follow the steps of this recipe. To do that, use the Add Step option in the Cluster Details page of the running cluster to perform the actions mentioned in step 5.
Select the Pig Program option under the Add Step dropdown of the Steps section. Click on Configure and add to configure the Pig script, input, and output data for our computation. Specify the S3 location of the Pig script we uploaded in step 3, in the Script S3 location textbox. You should specify the location of the script in the format s3://bucket_name/script_filename. Specify the S3 location of the uploaded input data file in the Input S3 Location textbox. In the Output S3 Location textbox, specify an S3 location to store the output. The output path should not exist; we use a non-existing directory (for example, hcb-c2-out/pig) inside the output bucket as the output path. You should specify the locations using the format s3://bucket_name/path. Click on Add.
Click on Create Cluster to launch the EMR Hadoop cluster and to run the configured Pig script.

Note

Amazon will charge you for the compute and storage resources you use by clicking on Create Job Flow in step 8. Refer to the Saving money using EC2 Spot Instances to execute EMR job flows recipe that we discussed earlier to find out how you can save money by using Amazon EC2 Spot Instances.
Monitor the progress of your MapReduce cluster deployment and the computation in the Cluster List | Cluster Details page of the Elastic MapReduce console. Expand and refresh the Steps section of the page to see the status of the individual steps of the cluster setup and the application execution. Select a step and click on View logs to view the logs and to debug the computation. Check the output of the computation in the output data bucket using the AWS S3 console.

There's more...

Amazon EMR allows us to use Apache Pig in the interactive mode as well.

Starting a Pig interactive session

Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create Cluster button to create a new EMR cluster. Provide a name for your cluster.
You must select a key pair from the Amazon EC2 Key Pair dropdown in the Security and Access section. If you do not have a usable Amazon EC2 key pair with access to the private key, log on to the Amazon EC2 console and create a new key pair.
Click on Create Cluster without specifying any steps. Make sure No is selected in the Auto-Terminate option under the Steps section.
Monitor the progress of your MapReduce cluster deployment and the computation in the Cluster Details page under Cluster List of the Elastic MapReduce console. Retrieve Master Public DNS from the cluster details in this page.
Use the master public DNS name and the private key file of the Amazon EC2 key pair you specified in step 2 to SSH into the master node of the cluster:
```
$ ssh -i <path-to-the-key-file> hadoop@<master-public-DNS>
```
Start the Pig interactive Grunt shell in the master node and issue your Pig commands:
```
$ pig
.........
grunt>
```

官术网_书友最值得收藏!

Hadoop MapReduce v2 Cookbook（Second Edition）

Executing a Pig script using EMR

How to do it...

Note

Note

There's more...

Starting a Pig interactive session