官术网_书友最值得收藏!

Executing a Hive script using EMR

Hive provides a SQL-like query layer for the data stored in HDFS utilizing Hadoop MapReduce underneath. Amazon EMR supports executing Hive queries on the data stored in S3. Refer to the Apache Hive recipes in Chapter 6, Hadoop Ecosystem – Apache Hive, for more information on using Hive for large-scale data analysis.

In this recipe, we are going to execute a Hive script to perform the computation we did in the Executing a Pig script using EMR recipe earlier. We will use the Human Development Reports data (http://hdr.undp.org/en/statistics/data/) to print names of countries that have a GNI value greater than $2000 of gross national income per capita (GNI) sorted by GNI.

How to do it...

The following steps show how to use a Hive script with Amazon Elastic MapReduce to query a dataset stored on Amazon S3:

  1. Use the Amazon S3 console to create a bucket in S3 to upload the input data. Create a directory inside the bucket. Upload the resources/hdi-data.csv file in the source package of this chapter to the newly created directory inside the bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file is hcb-c2-data/data/hdi-data.csv.
  2. Review the Hive script available in the resources/countryFilter-EMR.hql file of the source repository for this chapter. This script first creates a mapping of the input data to a Hive table. Then we create a Hive table to store the results of our query. Finally, we issue a query to select the list of countries with a GNI larger than $2000. We use the $INPUT and $OUTPUT variables to specify the location of the input data and the location to store the output table data.
    CREATE EXTERNAL TABLE 
    hdi(
        id INT, 
        country STRING, 
        hdi FLOAT, 
        lifeex INT, 
        mysch INT, 
        eysch INT, 
        gni INT) 
    ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY ',' 
    STORED AS TEXTFILE
    LOCATION '${INPUT}';
    
    CREATE EXTERNAL TABLE 
    output_countries(
        country STRING, 
        gni INT) 
        ROW FORMAT DELIMITED
        FIELDS TERMINATED BY ','
        STORED AS TEXTFILE
        LOCATION '${OUTPUT}';
    
    INSERT OVERWRITE TABLE 
    output_countries
      SELECT 
        country, gni 
      FROM 
        hdi 
      WHERE 
        gni > 2000;
  3. Use the Amazon S3 console to create a bucket in S3 to upload the Hive script. Upload the resources/countryFilter-EMR.hql script to the newly created bucket. You can also use an existing bucket or a directory inside a bucket as well. We assume the S3 path for the uploaded file is hcb-resources/countryFilter-EMR.hql.
  4. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create Cluster button to create a new EMR cluster. Provide a name for your cluster. Follow steps 8 to 11 of the Running Hadoop MapReduce v2 computations using Amazon Elastic MapReduce recipe to configure your cluster.

    Note

    You can reuse an EMR cluster you created for one of the earlier recipes to follow the steps of this recipe. To do that, use the Add Step option in the Cluster Details page of the running cluster to perform the actions mentioned in step 5.

  5. Select the Hive Program option under the Add Step dropdown of the Steps section. Click on Configure and add to configure the Hive script, and input and output data for our computation. Specify the S3 location of the Hive script we uploaded in step 3 in the Script S3 location textbox. You should specify the location of the script in the format s3://bucket_name/script_filename. Specify the S3 location of the uploaded input data directory in the Input S3 Location textbox. In the Output S3 Location textbox, specify an S3 location to store the output. The output path should not exist and we use a nonexisting directory (for example, hcb-c2-out/hive) inside the output bucket as the output path. You should specify the locations using the format s3://bucket_name/path. Click on Add.
    How to do it...
  6. Click on Create Cluster to launch the EMR Hadoop cluster and to run the configured Hive script.

    Note

    Amazon will charge you for the compute and storage resources you use by clicking on Create Job Flow in step 8. Refer to the Saving money using Amazon EC2 Spot Instances to execute EMR job flows to execute EMR job flows recipe that we discussed earlier to find out how you can save money by using Amazon EC2 Spot Instances.

  7. Monitor the progress of your MapReduce cluster deployment and the computation in the Cluster Details page under Cluster List of the Elastic MapReduce console. Expand and refresh the Steps section of the page to see the status of the individual steps of the cluster setup and the application execution. Select a step and click on View logs to view the logs and to debug the computation. Check the output of the computation in the output data bucket using the AWS S3 console.

There's more...

Amazon EMR also allows us to use the Hive shell in the interactive mode as well.

Starting a Hive interactive session

Follow steps 1 to 5 of the Starting a Pig interactive session section of the previous Executing a Pig script using EMR recipe to create a cluster and to log in to it using SSH.

Start the Hive shell in the master node and issue your Hive queries:

$ hive
hive >
.........

See also

The Simple SQL-style data querying using Apache Hive recipe of Chapter 6, Hadoop Ecosystem – Apache Hive.

主站蜘蛛池模板: 城固县| 乌恰县| 奇台县| 务川| 措美县| 华阴市| 望江县| 巫溪县| 娄烦县| 通辽市| 太湖县| 广昌县| 平舆县| 平舆县| 左云县| 建湖县| 阿拉善右旗| 木兰县| 昭通市| 景谷| 澎湖县| 嘉兴市| 绥芬河市| 清流县| 抚松县| 虞城县| 安图县| 咸丰县| 绥宁县| 五莲县| 泰州市| 融水| 鹤峰县| 淮南市| 靖远县| 高碑店市| 千阳县| 临颍县| 鄂尔多斯市| 三门峡市| 巴中市|