有没有好玩的捕鱼游戏啊

書(shū)名： Hadoop MapReduce v2 Cookbook（Second Edition）
作者名： Thilina Gunarathne
本章字?jǐn)?shù)： 884字
更新時(shí)間： 2021-07-23 20:32:55

Executing a Hive script using EMR

Hive provides a SQL-like query layer for the data stored in HDFS utilizing Hadoop MapReduce underneath. Amazon EMR supports executing Hive queries on the data stored in S3. Refer to the Apache Hive recipes in Chapter 6, Hadoop Ecosystem – Apache Hive, for more information on using Hive for large-scale data analysis.

In this recipe, we are going to execute a Hive script to perform the computation we did in the Executing a Pig script using EMR recipe earlier. We will use the Human Development Reports data (http://hdr.undp.org/en/statistics/data/) to print names of countries that have a GNI value greater than $2000 of gross national income per capita (GNI) sorted by GNI.

How to do it...

The following steps show how to use a Hive script with Amazon Elastic MapReduce to query a dataset stored on Amazon S3:

Use the Amazon S3 console to create a bucket in S3 to upload the input data. Create a directory inside the bucket. Upload the resources/hdi-data.csv file in the source package of this chapter to the newly created directory inside the bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file is hcb-c2-data/data/hdi-data.csv.

Review the Hive script available in the resources/countryFilter-EMR.hql file of the source repository for this chapter. This script first creates a mapping of the input data to a Hive table. Then we create a Hive table to store the results of our query. Finally, we issue a query to select the list of countries with a GNI larger than $2000. We use the $INPUT and $OUTPUT variables to specify the location of the input data and the location to store the output table data.

CREATE EXTERNAL TABLE 
hdi(
    id INT, 
    country STRING, 
    hdi FLOAT, 
    lifeex INT, 
    mysch INT, 
    eysch INT, 
    gni INT) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE
LOCATION '${INPUT}';

CREATE EXTERNAL TABLE 
output_countries(
    country STRING, 
    gni INT) 
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    LOCATION '${OUTPUT}';

INSERT OVERWRITE TABLE 
output_countries
  SELECT 
    country, gni 
  FROM 
    hdi 
  WHERE 
    gni > 2000;

Use the Amazon S3 console to create a bucket in S3 to upload the Hive script. Upload the resources/countryFilter-EMR.hql script to the newly created bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file is hcb-resources/countryFilter-EMR.hql.
Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create Cluster button to create a new EMR cluster. Provide a name for your cluster. Follow steps 8 to 11 of the Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce recipe to configure your cluster.

Note

You can reuse an EMR cluster you created for one of the earlier recipes to follow the steps of this recipe. To do that, use the Add Step option in the Cluster Details page of the running cluster to perform the actions mentioned in step 5.
Select the Hive Program option under the Add Step dropdown of the Steps section. Click on Configure and add to configure the Hive script, and input and output data for our computation. Specify the S3 location of the Hive script we uploaded in step 3 in the Script S3 location textbox. You should specify the location of the script in the format s3://bucket_name/script_filename. Specify the S3 location of the uploaded input data directory in the Input S3 Location textbox. In the Output S3 Location textbox, specify an S3 location to store the output. The output path should not exist and we use a nonexisting directory (for example, hcb-c2-out/hive) inside the output bucket as the output path. You should specify the locations using the format s3://bucket_name/path. Click on Add.
Click on Create Cluster to launch the EMR Hadoop cluster and to run the configured Hive script.

Note

Amazon will charge you for the compute and storage resources you use by clicking on Create Job Flow in step 8. Refer to the Saving money using Amazon EC2 Spot Instances to execute EMR job flows to execute EMR job flows recipe that we discussed earlier to find out how you can save money by using Amazon EC2 Spot Instances.
Monitor the progress of your MapReduce cluster deployment and the computation in the Cluster Details page under Cluster List of the Elastic MapReduce console. Expand and refresh the Steps section of the page to see the status of the individual steps of the cluster setup and the application execution. Select a step and click on View logs to view the logs and to debug the computation. Check the output of the computation in the output data bucket using the AWS S3 console.

There's more...

Amazon EMR also allows us to use the Hive shell in the interactive mode as well.

Starting a Hive interactive session

Follow steps 1 to 5 of the Starting a Pig interactive session section of the previous Executing a Pig script using EMR recipe to create a cluster and to log in to it using SSH.

Start the Hive shell in the master node and issue your Hive queries:

$ hive
hive >
.........

官术网_书友最值得收藏!

Executing a Hive script using EMR

How to do it...

Note

Note

There's more...

Starting a Hive interactive session

See also

官术网_书友最值得收藏!

Hadoop MapReduce v2 Cookbook（Second Edition）

Executing a Hive script using EMR

How to do it...

Note

Note

There's more...

Starting a Hive interactive session

See also