- Hadoop MapReduce v2 Cookbook(Second Edition)
- Thilina Gunarathne
- 922字
- 2021-07-23 20:32:55
Executing a Pig script using EMR
Amazon EMR supports executing Apache Pig scripts on the data stored in S3. Refer to the Pig-related recipes in Chapter 7, Hadoop Ecosystem II – Pig, HBase, Mahout, and Sqoop, for more details on using Apache Pig for data analysis.
In this recipe, we are going to execute a simple Pig script using Amazon EMR. This sample will use the Human Development Reports data (http://hdr.undp.org/en/statistics/data/) to print names of countries that have a GNI value greater than $2000 of gross national income per capita (GNI) sorted by GNI.
How to do it...
The following steps show you how to use a Pig script with Amazon Elastic MapReduce to process a dataset stored on Amazon S3:
- Use the Amazon S3 console to create a bucket in S3 to upload the input data. Upload the
resources/hdi-data.csv
file in the source repository for this chapter to the newly created bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file ishcb-c2-data/hdi-data.csv
. - Review the Pig script available in the
resources/countryFilter-EMR.pig
file of the source repository for this chapter. This script uses theSTORE
command to save the result in the filesystem. In addition, we parameterize theLOAD
command of the Pig script by adding$INPUT
as the input file and the store command by adding$OUTPUT
as the output directory. These two parameters would be substituted by the S3 input and output locations we specify in step 5.A = LOAD '$INPUT' using PigStorage(',') AS (id:int, country:chararray, hdi:float, lifeex:int, mysch:int, eysch:int, gni:int); B = FILTER A BY gni > 2000; C = ORDER B BY gni; STORE C into '$OUTPUT';
- Use the Amazon S3 console to create a bucket in S3 to upload the Pig script. Upload the
resources/countryFilter-EMR.pig
script to the newly created bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file ashcb-c2-resources/countryFilter-EMR.pig
. - Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create Cluster button to create a new EMR cluster. Provide a name for your cluster. Follow steps 8 to 11 of the Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce recipe to configure your cluster.
- Select the Pig Program option under the Add Step dropdown of the Steps section. Click on Configure and add to configure the Pig script, input, and output data for our computation. Specify the S3 location of the Pig script we uploaded in step 3, in the Script S3 location textbox. You should specify the location of the script in the format
s3://bucket_name/script_filename
. Specify the S3 location of the uploaded input data file in the Input S3 Location textbox. In the Output S3 Location textbox, specify an S3 location to store the output. The output path should not exist; we use a non-existing directory (for example,hcb-c2-out/pig
) inside the output bucket as the output path. You should specify the locations using the formats3://bucket_name/path
. Click on Add. - Click on Create Cluster to launch the EMR Hadoop cluster and to run the configured Pig script.
- Monitor the progress of your MapReduce cluster deployment and the computation in the Cluster List | Cluster Details page of the Elastic MapReduce console. Expand and refresh the Steps section of the page to see the status of the individual steps of the cluster setup and the application execution. Select a step and click on View logs to view the logs and to debug the computation. Check the output of the computation in the output data bucket using the AWS S3 console.
There's more...
Amazon EMR allows us to use Apache Pig in the interactive mode as well.
- Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create Cluster button to create a new EMR cluster. Provide a name for your cluster.
- You must select a key pair from the Amazon EC2 Key Pair dropdown in the Security and Access section. If you do not have a usable Amazon EC2 key pair with access to the private key, log on to the Amazon EC2 console and create a new key pair.
- Click on Create Cluster without specifying any steps. Make sure No is selected in the Auto-Terminate option under the Steps section.
- Monitor the progress of your MapReduce cluster deployment and the computation in the Cluster Details page under Cluster List of the Elastic MapReduce console. Retrieve Master Public DNS from the cluster details in this page.
- Use the master public DNS name and the private key file of the Amazon EC2 key pair you specified in step 2 to SSH into the master node of the cluster:
$ ssh -i <path-to-the-key-file> hadoop@<master-public-DNS>
- Start the Pig interactive Grunt shell in the master node and issue your Pig commands:
$ pig ......... grunt>
推薦閱讀
- Embedded Linux Projects Using Yocto Project Cookbook
- 實戰Java程序設計
- DevOps Automation Cookbook
- AngularJS深度剖析與最佳實踐
- Reactive Programming With Java 9
- Effective Python Penetration Testing
- Scientific Computing with Scala
- 軟件測試技術指南
- IBM Cognos Business Intelligence 10.1 Dashboarding cookbook
- Unity 2018 Shaders and Effects Cookbook
- Domain-Driven Design in PHP
- Swift語言實戰晉級
- C#面向對象程序設計(第2版)
- 從零開始構建深度前饋神經網絡:Python+TensorFlow 2.x
- Developing Java Applications with Spring and Spring Boot