- Hadoop MapReduce v2 Cookbook(Second Edition)
- Thilina Gunarathne
- 884字
- 2021-07-23 20:32:55
Executing a Hive script using EMR
Hive provides a SQL-like query layer for the data stored in HDFS utilizing Hadoop MapReduce underneath. Amazon EMR supports executing Hive queries on the data stored in S3. Refer to the Apache Hive recipes in Chapter 6, Hadoop Ecosystem – Apache Hive, for more information on using Hive for large-scale data analysis.
In this recipe, we are going to execute a Hive script to perform the computation we did in the Executing a Pig script using EMR recipe earlier. We will use the Human Development Reports data (http://hdr.undp.org/en/statistics/data/) to print names of countries that have a GNI value greater than $2000 of gross national income per capita (GNI) sorted by GNI.
How to do it...
The following steps show how to use a Hive script with Amazon Elastic MapReduce to query a dataset stored on Amazon S3:
- Use the Amazon S3 console to create a bucket in S3 to upload the input data. Create a directory inside the bucket. Upload the
resources/hdi-data.csv
file in the source package of this chapter to the newly created directory inside the bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file ishcb-c2-data/data/hdi-data.csv
. - Review the Hive script available in the
resources/countryFilter-EMR.hql
file of the source repository for this chapter. This script first creates a mapping of the input data to a Hive table. Then we create a Hive table to store the results of our query. Finally, we issue a query to select the list of countries with a GNI larger than $2000. We use the$INPUT
and$OUTPUT
variables to specify the location of the input data and the location to store the output table data.CREATE EXTERNAL TABLE hdi( id INT, country STRING, hdi FLOAT, lifeex INT, mysch INT, eysch INT, gni INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '${INPUT}'; CREATE EXTERNAL TABLE output_countries( country STRING, gni INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '${OUTPUT}'; INSERT OVERWRITE TABLE output_countries SELECT country, gni FROM hdi WHERE gni > 2000;
- Use the Amazon S3 console to create a bucket in S3 to upload the Hive script. Upload the
resources/countryFilter-EMR.hql
script to the newly created bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file ishcb-resources/countryFilter-EMR.hql
. - Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create Cluster button to create a new EMR cluster. Provide a name for your cluster. Follow steps 8 to 11 of the Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce recipe to configure your cluster.
- Select the Hive Program option under the Add Step dropdown of the Steps section. Click on Configure and add to configure the Hive script, and input and output data for our computation. Specify the S3 location of the Hive script we uploaded in step 3 in the Script S3 location textbox. You should specify the location of the script in the format
s3://bucket_name/script_filename
. Specify the S3 location of the uploaded input data directory in the Input S3 Location textbox. In the Output S3 Location textbox, specify an S3 location to store the output. The output path should not exist and we use a nonexisting directory (for example,hcb-c2-out/hive
) inside the output bucket as the output path. You should specify the locations using the formats3://bucket_name/path
. Click on Add. - Click on Create Cluster to launch the EMR Hadoop cluster and to run the configured Hive script.
Note
Amazon will charge you for the compute and storage resources you use by clicking on Create Job Flow in step 8. Refer to the Saving money using Amazon EC2 Spot Instances to execute EMR job flows to execute EMR job flows recipe that we discussed earlier to find out how you can save money by using Amazon EC2 Spot Instances.
- Monitor the progress of your MapReduce cluster deployment and the computation in the Cluster Details page under Cluster List of the Elastic MapReduce console. Expand and refresh the Steps section of the page to see the status of the individual steps of the cluster setup and the application execution. Select a step and click on View logs to view the logs and to debug the computation. Check the output of the computation in the output data bucket using the AWS S3 console.
There's more...
Amazon EMR also allows us to use the Hive shell in the interactive mode as well.
See also
The Simple SQL-style data querying using Apache Hive recipe of Chapter 6, Hadoop Ecosystem – Apache Hive.
- Visual FoxPro程序設計教程
- Linux C/C++服務器開發實踐
- Learning Data Mining with Python
- Android 9 Development Cookbook(Third Edition)
- Mastering Ubuntu Server
- Learning Python Design Patterns(Second Edition)
- WordPress Plugin Development Cookbook(Second Edition)
- 深度學習:算法入門與Keras編程實踐
- Learning FuelPHP for Effective PHP Development
- Natural Language Processing with Java and LingPipe Cookbook
- Mastering ArcGIS Enterprise Administration
- BeagleBone Robotic Projects(Second Edition)
- Java Web從入門到精通(第2版)
- WebStorm Essentials
- 高效使用Greenplum:入門、進階與數據中臺