抢庄牛牛677棋牌

書名： Hadoop MapReduce v2 Cookbook（Second Edition）
作者名： Thilina Gunarathne
本章字數： 970字
更新時間： 2021-07-23 20:32:56

Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment

Apache Whirr provides a set of cloud-vendor-neutral set of libraries to provision services on the cloud resources. Apache Whirr supports the provisioning, installing, and configuring of Hadoop clusters in several cloud environments. In addition to Hadoop, Apache Whirr also supports the provisioning of Apache Cassandra, Apache ZooKeeper, Apache HBase, Voldemort (key-value storage), and Apache Hama clusters on the cloud environments.

Note

The installation programs of several commercial Hadoop distributions, such as Hortonworks HDP and Cloudera CDH, now support installation and configuration of those distributions on Amazon EC2 instances. These commercial-distribution-based installations would provide you with a more feature-rich Hadoop cluster on the cloud than using Apache Whirr.

In this recipe, we are going to start a Hadoop cluster on Amazon EC2 using Apache Whirr and run the WordCount MapReduce sample (the Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode recipe from Chapter 1, Getting Started with Hadoop v2) program on that cluster.

How to do it...

The following are the steps to deploy a Hadoop cluster on Amazon EC2 using Apache Whirr and to execute the WordCount MapReduce sample on the deployed cluster:

Download and unzip the Apache Whirr binary distribution from http://whirr.apache.org/. You may be able to install Whirr through your Hadoop distribution as well.
Run the following command from the extracted directory to verify your Whirr installation:
```
$ whirr version
Apache Whirr 0.8.2
jclouds 1.5.8
```
Export your AWS access keys to the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment parameters:
```
$ export AWS_ACCESS_KEY_ID=AKIA…
$ export AWS_SECRET_ACCESS_KEY=…
```
Generate an rsa key pair using the following command. This key pair is not the same as your AWS key pair.
```
$ssh-keygen -t rsa -P ''
```
Locate the file named recipes/hadoop-yarn-ec2.properties in your Apache Whirr installation. Copy it to your working directory. Change the whirr.hadoop.version property to match a current Hadoop version (for example, 2.5.2) available in the Apache Hadoop downloads page.
If you provided a custom name for your key-pair in the previous step, change the whirr.private-key-file and the whirr.public-key-file property values in the hadoop-yarn-ec2.properties file to the paths of the private key and the public key you generated.

Tip

The whirr.aws-ec2-spot-price property is an optional property that allows us to use cheaper EC2 Spot Instances. You can delete that property to use EC2 traditional on-demand instances.
Execute the following command pointing to your hadoop-yarn-ec2.properties file to launch your Hadoop cluster on EC2. After the successful cluster creation, this command outputs an SSH command that we can use to log in to the EC2 Hadoop cluster.
```
$bin/whirr launch-cluster --config hadoop-yarn-ec2.properties
```
The traffic from the outside to the provisioned EC2 Hadoop cluster is routed through the master node. Whirr generates a script that we can use to start this proxy, under a subdirectory named after your Hadoop cluster inside the ~/.whirr directory. Run this in a new terminal. It will take a few minutes for Whirr to start the cluster and to generate this script.
```
$cd ~/.whirr/Hadoop-yarn/
$hadoop-proxy.sh
```
You can open the Hadoop web-based monitoring console in your local machine by configuring this proxy in your web browser.
Whirr generates a hadoop-site.xml file for your cluster in the ~/.whirr/<your cluster name> directory. You can use it to issue Hadoop commands from your local machine to your Hadoop cluster on EC2. Export the path of the generated hadoop-site.xml to an environmental variable named HADOOP_CONF_DIR. Copy the hadoop-site.xml file in this directory to another file named core-site.xml. To execute the Hadoop commands, you should have Hadoop v2 binaries installed in your machine.
```
$ cp ~/.whirr/hadoop-yarn/hadoop-site.xml ~/.whirr/hadoop-yarn/core-site.xml
$ export HADOOP_CONF_DIR=~/.whirr/hadoop-yarn/
$ hdfs dfs -ls /
```
Create a directory named wc-input-data in HDFS and upload a text dataset to that directory. Depending on the version of Whirr, you may have to create your home directory first.
```
$ hdfs dfs –mkdir /user/<user_name>
$ hdfs dfs -mkdir wc-input-data
$ hdfs dfs -put sample.txt wc-input-data
```
In this step, we run the Hadoop WordCount sample in the Hadoop cluster we started in Amazon EC2:
```
$ hadoop jar hcb-c1-samples.jar chapter1.WordCount \
wc-input-data wc-out
```

View the results of the WordCount computation by executing the following commands:

$hadoop fs -ls wc-out
Found 3 items
-rw-r--r-- 3 thilina supergroup 0 2012-09-05 15:40 /user/thilina/wc-out/_SUCCESS
drwxrwxrwx - thilina supergroup 0 2012-09-05 15:39 /user/thilina/wc-out/_logs
-rw-r--r-- 3 thilina supergroup 19908 2012-09-05 15:40 /user/thilina/wc-out/part-r-00000

$ hadoop fs -cat wc-out/part-* | more

Issue the following command to shut down the Hadoop cluster. Make sure to download any important data before shutting down the cluster, as the data will be permanently lost after shutting down the cluster.
```
$bin/whirr destroy-cluster --config hadoop.properties
```

How it works...

The following are the descriptions of the properties we used in the hadoop.properties file.

whirr.cluster-name=Hadoop-yarn

The preceding property provides a name for the cluster. The instances of the cluster will be tagged using this name.

whirr.instance-templates=1 hadoop-namenode+yarn-resource-manager+mapreduce-historyserver, 1 hadoop-datanode+yarn-nodemanager

This property specifies the number of instances to be used for each set of roles and the type of roles for the instances.

whirr.provider=aws-ec2

We use the Whirr Amazon EC2 provider to provision our cluster.

whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub

Both the properties mentioned earlier point to the paths of the private key and the public key you provide for the cluster.

whirr.hadoop.version=2.5.2

We specify a custom Hadoop version using the preceding property.

whirr.aws-ec2-spot-price=0.15

This property specifies a bid price for the Amazon EC2 Spot Instances. Specifying this property triggers Whirr to use EC2 Spot Instances for the cluster. If the bid price is not met, Apache Whirr Spot Instance requests a time out after 20 minutes. Refer to the Saving money using Amazon EC2 Spot Instances to execute EMR job flows recipe for more details.

More details on Whirr configuration can be found at http://whirr.apache.org/docs/0.8.1/configuration-guide.html.

官术网_书友最值得收藏!

Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment

Note

How to do it...

Tip

How it works...

See also

官术网_书友最值得收藏!

Hadoop MapReduce v2 Cookbook（Second Edition）

Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment

Note

How to do it...

Tip

How it works...

See also