官术网_书友最值得收藏!

Executing a Hive script using EMR

Hive provides a SQL-like query layer for the data stored in HDFS utilizing Hadoop MapReduce underneath. Amazon EMR supports executing Hive queries on the data stored in S3. Refer to the Apache Hive recipes in Chapter 6, Hadoop Ecosystem – Apache Hive, for more information on using Hive for large-scale data analysis.

In this recipe, we are going to execute a Hive script to perform the computation we did in the Executing a Pig script using EMR recipe earlier. We will use the Human Development Reports data (http://hdr.undp.org/en/statistics/data/) to print names of countries that have a GNI value greater than $2000 of gross national income per capita (GNI) sorted by GNI.

How to do it...

The following steps show how to use a Hive script with Amazon Elastic MapReduce to query a dataset stored on Amazon S3:

  1. Use the Amazon S3 console to create a bucket in S3 to upload the input data. Create a directory inside the bucket. Upload the resources/hdi-data.csv file in the source package of this chapter to the newly created directory inside the bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file is hcb-c2-data/data/hdi-data.csv.
  2. Review the Hive script available in the resources/countryFilter-EMR.hql file of the source repository for this chapter. This script first creates a mapping of the input data to a Hive table. Then we create a Hive table to store the results of our query. Finally, we issue a query to select the list of countries with a GNI larger than $2000. We use the $INPUT and $OUTPUT variables to specify the location of the input data and the location to store the output table data.
    CREATE EXTERNAL TABLE 
    hdi(
        id INT, 
        country STRING, 
        hdi FLOAT, 
        lifeex INT, 
        mysch INT, 
        eysch INT, 
        gni INT) 
    ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ',' 
    STORED AS TEXTFILE
    LOCATION '${INPUT}';
    
    CREATE EXTERNAL TABLE 
    output_countries(
        country STRING, 
        gni INT) 
        ROW FORMAT DELIMITED
        FIELDS TERMINATED BY ','
        STORED AS TEXTFILE
        LOCATION '${OUTPUT}';
    
    INSERT OVERWRITE TABLE 
    output_countries
      SELECT 
        country, gni 
      FROM 
        hdi 
      WHERE 
        gni > 2000;
  3. Use the Amazon S3 console to create a bucket in S3 to upload the Hive script. Upload the resources/countryFilter-EMR.hql script to the newly created bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file is hcb-resources/countryFilter-EMR.hql.
  4. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create Cluster button to create a new EMR cluster. Provide a name for your cluster. Follow steps 8 to 11 of the Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce recipe to configure your cluster.

    Note

    You can reuse an EMR cluster you created for one of the earlier recipes to follow the steps of this recipe. To do that, use the Add Step option in the Cluster Details page of the running cluster to perform the actions mentioned in step 5.

  5. Select the Hive Program option under the Add Step dropdown of the Steps section. Click on Configure and add to configure the Hive script, and input and output data for our computation. Specify the S3 location of the Hive script we uploaded in step 3 in the Script S3 location textbox. You should specify the location of the script in the format s3://bucket_name/script_filename. Specify the S3 location of the uploaded input data directory in the Input S3 Location textbox. In the Output S3 Location textbox, specify an S3 location to store the output. The output path should not exist and we use a nonexisting directory (for example, hcb-c2-out/hive) inside the output bucket as the output path. You should specify the locations using the format s3://bucket_name/path. Click on Add.
    How to do it...
  6. Click on Create Cluster to launch the EMR Hadoop cluster and to run the configured Hive script.

    Note

    Amazon will charge you for the compute and storage resources you use by clicking on Create Job Flow in step 8. Refer to the Saving money using Amazon EC2 Spot Instances to execute EMR job flows to execute EMR job flows recipe that we discussed earlier to find out how you can save money by using Amazon EC2 Spot Instances.

  7. Monitor the progress of your MapReduce cluster deployment and the computation in the Cluster Details page under Cluster List of the Elastic MapReduce console. Expand and refresh the Steps section of the page to see the status of the individual steps of the cluster setup and the application execution. Select a step and click on View logs to view the logs and to debug the computation. Check the output of the computation in the output data bucket using the AWS S3 console.

There's more...

Amazon EMR also allows us to use the Hive shell in the interactive mode as well.

Starting a Hive interactive session

Follow steps 1 to 5 of the Starting a Pig interactive session section of the previous Executing a Pig script using EMR recipe to create a cluster and to log in to it using SSH.

Start the Hive shell in the master node and issue your Hive queries:

$ hive
hive >
.........

See also

The Simple SQL-style data querying using Apache Hive recipe of Chapter 6, Hadoop Ecosystem – Apache Hive.

主站蜘蛛池模板: 慈利县| 临邑县| 赞皇县| 隆德县| 五河县| 晴隆县| 长宁县| 永春县| 凉城县| 固阳县| 囊谦县| 贵州省| 邻水| 修水县| 买车| 昆山市| 临夏市| 大新县| 苏尼特左旗| 津南区| 天全县| 肥乡县| 仁怀市| 铜川市| 马鞍山市| 深泽县| 鄂伦春自治旗| 宜君县| 正定县| 宜黄县| 临清市| 茌平县| 登封市| 易门县| 天津市| 淮滨县| 色达县| 浦北县| 贺州市| 遂溪县| 临高县|