- Hadoop MapReduce v2 Cookbook(Second Edition)
- Thilina Gunarathne
- 884字
- 2021-07-23 20:32:55
Executing a Hive script using EMR
Hive provides a SQL-like query layer for the data stored in HDFS utilizing Hadoop MapReduce underneath. Amazon EMR supports executing Hive queries on the data stored in S3. Refer to the Apache Hive recipes in Chapter 6, Hadoop Ecosystem – Apache Hive, for more information on using Hive for large-scale data analysis.
In this recipe, we are going to execute a Hive script to perform the computation we did in the Executing a Pig script using EMR recipe earlier. We will use the Human Development Reports data (http://hdr.undp.org/en/statistics/data/) to print names of countries that have a GNI value greater than $2000 of gross national income per capita (GNI) sorted by GNI.
How to do it...
The following steps show how to use a Hive script with Amazon Elastic MapReduce to query a dataset stored on Amazon S3:
- Use the Amazon S3 console to create a bucket in S3 to upload the input data. Create a directory inside the bucket. Upload the
resources/hdi-data.csv
file in the source package of this chapter to the newly created directory inside the bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file ishcb-c2-data/data/hdi-data.csv
. - Review the Hive script available in the
resources/countryFilter-EMR.hql
file of the source repository for this chapter. This script first creates a mapping of the input data to a Hive table. Then we create a Hive table to store the results of our query. Finally, we issue a query to select the list of countries with a GNI larger than $2000. We use the$INPUT
and$OUTPUT
variables to specify the location of the input data and the location to store the output table data.CREATE EXTERNAL TABLE hdi( id INT, country STRING, hdi FLOAT, lifeex INT, mysch INT, eysch INT, gni INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '${INPUT}'; CREATE EXTERNAL TABLE output_countries( country STRING, gni INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '${OUTPUT}'; INSERT OVERWRITE TABLE output_countries SELECT country, gni FROM hdi WHERE gni > 2000;
- Use the Amazon S3 console to create a bucket in S3 to upload the Hive script. Upload the
resources/countryFilter-EMR.hql
script to the newly created bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file ishcb-resources/countryFilter-EMR.hql
. - Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create Cluster button to create a new EMR cluster. Provide a name for your cluster. Follow steps 8 to 11 of the Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce recipe to configure your cluster.
- Select the Hive Program option under the Add Step dropdown of the Steps section. Click on Configure and add to configure the Hive script, and input and output data for our computation. Specify the S3 location of the Hive script we uploaded in step 3 in the Script S3 location textbox. You should specify the location of the script in the format
s3://bucket_name/script_filename
. Specify the S3 location of the uploaded input data directory in the Input S3 Location textbox. In the Output S3 Location textbox, specify an S3 location to store the output. The output path should not exist and we use a nonexisting directory (for example,hcb-c2-out/hive
) inside the output bucket as the output path. You should specify the locations using the formats3://bucket_name/path
. Click on Add. - Click on Create Cluster to launch the EMR Hadoop cluster and to run the configured Hive script.
Note
Amazon will charge you for the compute and storage resources you use by clicking on Create Job Flow in step 8. Refer to the Saving money using Amazon EC2 Spot Instances to execute EMR job flows to execute EMR job flows recipe that we discussed earlier to find out how you can save money by using Amazon EC2 Spot Instances.
- Monitor the progress of your MapReduce cluster deployment and the computation in the Cluster Details page under Cluster List of the Elastic MapReduce console. Expand and refresh the Steps section of the page to see the status of the individual steps of the cluster setup and the application execution. Select a step and click on View logs to view the logs and to debug the computation. Check the output of the computation in the output data bucket using the AWS S3 console.
There's more...
Amazon EMR also allows us to use the Hive shell in the interactive mode as well.
See also
The Simple SQL-style data querying using Apache Hive recipe of Chapter 6, Hadoop Ecosystem – Apache Hive.
- Data Visualization with D3 4.x Cookbook(Second Edition)
- Docker技術(shù)入門與實戰(zhàn)(第3版)
- Learning Docker
- JavaScript:Functional Programming for JavaScript Developers
- Android Native Development Kit Cookbook
- Unity 2017 Mobile Game Development
- Swift 4從零到精通iOS開發(fā)
- Java Fundamentals
- Mastering HTML5 Forms
- Python網(wǎng)絡(luò)爬蟲實例教程(視頻講解版)
- Get Your Hands Dirty on Clean Architecture
- Mastering PowerCLI
- Android初級應(yīng)用開發(fā)
- Unity 5 Game Optimization
- Building an E-Commerce Application with MEAN