官术网_书友最值得收藏!

Deploying an Apache HBase cluster on Amazon EC2 using EMR

We can use Amazon Elastic MapReduce to start an Apache HBase cluster on the Amazon infrastructure to store large quantities of data in a column-oriented data store. We can use the data stored on Amazon EMR HBase clusters as input and output of EMR MapReduce computations as well. We can incrementally back up the data stored in Amazon EMR HBase clusters to Amazon S3 for data persistence. We can also start an EMR HBase cluster by restoring the data from a previous S3 backup.

In this recipe, we start an Apache HBase cluster on Amazon EC2 using Amazon EMR; perform several simple operations on the newly created HBase cluster and back up the HBase data into Amazon S3 before shutting down the cluster. Then we start a new HBase cluster restoring the HBase data backups from the original HBase cluster.

Getting ready

You should have the AWS CLI installed and configured to manually back up HBase data. Refer to the Creating an Amazon EMR job flow using the AWS Command Line Interface recipe in this chapter for more information on installing and configuring the AWS CLI.

How to do it...

The following steps show how to deploy an Apache HBase cluster on Amazon EC2 using Amazon EMR:

  1. Create an S3 bucket to store the HBase backups. We assume the S3 bucket for the HBase data backups is hcb-c2-data.
  2. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce. Click on the Create Cluster button to create a new EMR cluster. Provide a name for your cluster.
  3. Provide a path in Log folder S3 location and select an AMI version with Hadoop v2 (for example, AMI version 3.1.0 with Hadoop 2.4.0).
  4. Select HBase from the Additional Applications drop-down box under the Applications to be installed section. Click on Configure and add.
  5. Make sure the Restore from backup radio button is not selected. Select the Schedule regular backups and Consistent Backup radio buttons. Specify a Backup frequency for automatic scheduled incremental data backups and provide a path inside the Blob we created in step 1 as the backup location. Click on Continue.
    How to do it...
  6. Configure the EC2 instances under the Hardware Configuration section.
  7. Select a key pair in the Amazon EC2 Key Pair drop-down box. Make sure you have the private key for the selected EC2 key pair downloaded on your computer.

    Note

    If you do not have a usable key pair, go to the EC2 console (https://console.aws.amazon.com/ec2) to create a key pair. To create a key pair, log in to the EC2 dashboard, select a region, and click on Key Pairs under the Network and Security menu. Click on the Create Key Pair button in the Key Pairs window and provide a name for the new key pair. Download and save the private key file (in the PEM format) in a safe location.

  8. Click on the Create Cluster button to deploy the EMR HBase cluster.

    Note

    Amazon will charge you for the compute and storage resources you use by clicking on Create Cluster in the preceding step. Refer to the Saving money using Amazon EC2 Spot Instances to execute EMR job flows recipe that we discussed earlier to find out how you can save money by using Amazon EC2 Spot Instances.

The following steps will show you how to connect to the master node of the deployed HBase cluster to start the HBase shell:

  1. Go to the Amazon EMR console (https://console.aws.amazon.com/elasticmapreduce). Select the Cluster details for the HBase cluster to view more information about the cluster. Retrieve Master Public DNS Name from the information pane.
  2. Use the master public DNS name and the EC2 PEM-based key (selected in step 4) to connect to the master node of the HBase cluster:
    $ ssh -i ec2.pem hadoop@ec2-184-72-138-2.compute-1.amazonaws.com
    
  3. Start the HBase shell using the hbase shell command. Create a table named 'test' in your HBase installation and insert a sample entry to the table using the put command. Use the scan command to view the contents of the table.
    $ hbase shell
    .........
    
    hbase(main):001:0> create 'test','cf'
    0 row(s) in 2.5800 seconds
    
    hbase(main):002:0> put 'test','row1','cf:a','value1'
    0 row(s) in 0.1570 seconds
    
    hbase(main):003:0> scan 'test'
    ROW COLUMN+CELL
     row1 column=cf:a, timestamp=1347261400477, value=value1 
    1 row(s) in 0.0440 seconds
    
    hbase(main):004:0> quit
    

    The following step will back up the data stored in an Amazon EMR HBase cluster.

  4. Execute the following command using the AWS CLI to schedule a periodic backup of the data stored in an EMR HBase cluster. Retrieve the cluster ID (for example, j-FDMXCBZP9P85) from the EMR console. Replace the <cluster_id> using the retrieved job flow name. Change the backup directory path (s3://hcb-c2-data/hbase-backup) according to your backup data Blob. Wait for several minutes for the backup to be performed.
    $ aws emr schedule-hbase-backup --cluster-id <cluster_id> \
     --type full –dir s3://hcb-c2-data/hbase-backup \
    --interval 1 --unit hours 
    
  5. Go to the Cluster Details page in the EMR console and click on Terminate.

    Now, we will start a new Amazon EMR HBase cluster by restoring data from a backup:

  6. Create a new job flow by clicking on the Create Cluster button in the EMR console. Provide a name for your cluster. Provide a path in Log folder S3 location and select an AMI version with Hadoop v2 (for example, AMI version 3.1.0 with Hadoop 2.4.0).
  7. Select HBase from the Additional Applications drop-down box under the Applications to be installed section. Click on Configure and add.
  8. Configure the EMR HBase cluster to restore data from the previous data backup. Select the Restore from Backup option and provide the backup directory path you used in step 9 in the Backup Location textbox. You can leave the backup version textbox empty and the EMR would restore the latest backup. Click on Continue.
  9. Repeat steps 4, 5, 6, and 7.
  10. Start the HBase shell by logging in to? the master node of the new HBase cluster. Use the list command to list the set tables in HBase and the scan 'test' command to view the contents of the 'test' table.
    $ hbase shell
    .........
    
    hbase(main):001:0> list
    TABLE
    test
    1 row(s) in 1.4870 seconds
    
    hbase(main):002:0> scan 'test'
    ROW COLUMN+CELL
     row1 column=cf:a, timestamp=1347318118294, value=value1 
    1 row(s) in 0.2030 seconds
    
  11. Terminate your cluster using the EMR console by going to the Cluster Details page and clicking on the Terminate button.

See also

The HBase-related recipes in Chapter 7, Hadoop Ecosystem II – Pig, HBase, Mahout, and Sqoop.

主站蜘蛛池模板: 泗阳县| 丹凤县| 大名县| 临海市| 独山县| 宜兴市| 高青县| 湖北省| 盐源县| 宽城| 科技| 广灵县| 福建省| 东安县| 永顺县| 靖西县| 临武县| 乳山市| 西峡县| 从化市| 博兴县| 南安市| 会泽县| 裕民县| 秦皇岛市| 兰溪市| 绥芬河市| 闽侯县| 新安县| 扎赉特旗| 长顺县| 铁岭市| 贡觉县| 锡林浩特市| 东明县| 吉木萨尔县| 赤水市| 那曲县| 集贤县| 廉江市| 延庆县|