- Hadoop MapReduce v2 Cookbook(Second Edition)
- Thilina Gunarathne
- 970字
- 2021-07-23 20:32:56
Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment
Apache Whirr provides a set of cloud-vendor-neutral set of libraries to provision services on the cloud resources. Apache Whirr supports the provisioning, installing, and configuring of Hadoop clusters in several cloud environments. In addition to Hadoop, Apache Whirr also supports the provisioning of Apache Cassandra, Apache ZooKeeper, Apache HBase, Voldemort (key-value storage), and Apache Hama clusters on the cloud environments.
Note
The installation programs of several commercial Hadoop distributions, such as Hortonworks HDP and Cloudera CDH, now support installation and configuration of those distributions on Amazon EC2 instances. These commercial-distribution-based installations would provide you with a more feature-rich Hadoop cluster on the cloud than using Apache Whirr.
In this recipe, we are going to start a Hadoop cluster on Amazon EC2 using Apache Whirr and run the WordCount MapReduce sample (the Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode recipe from Chapter 1, Getting Started with Hadoop v2) program on that cluster.
How to do it...
The following are the steps to deploy a Hadoop cluster on Amazon EC2 using Apache Whirr and to execute the WordCount MapReduce sample on the deployed cluster:
- Download and unzip the Apache Whirr binary distribution from http://whirr.apache.org/. You may be able to install Whirr through your Hadoop distribution as well.
- Run the following command from the extracted directory to verify your Whirr installation:
$ whirr version Apache Whirr 0.8.2 jclouds 1.5.8
- Export your AWS access keys to the
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
environment parameters:$ export AWS_ACCESS_KEY_ID=AKIA… $ export AWS_SECRET_ACCESS_KEY=…
- Generate an rsa key pair using the following command. This key pair is not the same as your AWS key pair.
$ssh-keygen -t rsa -P ''
- Locate the file named
recipes/hadoop-yarn-ec2.properties
in your Apache Whirr installation. Copy it to your working directory. Change thewhirr.hadoop.version
property to match a current Hadoop version (for example, 2.5.2) available in the Apache Hadoop downloads page. - If you provided a custom name for your key-pair in the previous step, change the
whirr.private-key-file
and thewhirr.public-key-file
property values in thehadoop-yarn-ec2.properties
file to the paths of the private key and the public key you generated. - Execute the following command pointing to your
hadoop-yarn-ec2.properties
file to launch your Hadoop cluster on EC2. After the successful cluster creation, this command outputs an SSH command that we can use to log in to the EC2 Hadoop cluster.$bin/whirr launch-cluster --config hadoop-yarn-ec2.properties
- The traffic from the outside to the provisioned EC2 Hadoop cluster is routed through the master node. Whirr generates a script that we can use to start this proxy, under a subdirectory named after your Hadoop cluster inside the
~/.whirr
directory. Run this in a new terminal. It will take a few minutes for Whirr to start the cluster and to generate this script.$cd ~/.whirr/Hadoop-yarn/ $hadoop-proxy.sh
- You can open the Hadoop web-based monitoring console in your local machine by configuring this proxy in your web browser.
- Whirr generates a
hadoop-site.xml
file for your cluster in the~/.whirr/<your cluster name>
directory. You can use it to issue Hadoop commands from your local machine to your Hadoop cluster on EC2. Export the path of the generatedhadoop-site.xml
to an environmental variable namedHADOOP_CONF_DIR
. Copy thehadoop-site.xml
file in this directory to another file namedcore-site.xml
. To execute the Hadoop commands, you should have Hadoop v2 binaries installed in your machine.$ cp ~/.whirr/hadoop-yarn/hadoop-site.xml ~/.whirr/hadoop-yarn/core-site.xml $ export HADOOP_CONF_DIR=~/.whirr/hadoop-yarn/ $ hdfs dfs -ls /
- Create a directory named
wc-input-data
in HDFS and upload a text dataset to that directory. Depending on the version of Whirr, you may have to create your home directory first.$ hdfs dfs –mkdir /user/<user_name> $ hdfs dfs -mkdir wc-input-data $ hdfs dfs -put sample.txt wc-input-data
- In this step, we run the Hadoop WordCount sample in the Hadoop cluster we started in Amazon EC2:
$ hadoop jar hcb-c1-samples.jar chapter1.WordCount \ wc-input-data wc-out
- View the results of the WordCount computation by executing the following commands:
$hadoop fs -ls wc-out Found 3 items -rw-r--r-- 3 thilina supergroup 0 2012-09-05 15:40 /user/thilina/wc-out/_SUCCESS drwxrwxrwx - thilina supergroup 0 2012-09-05 15:39 /user/thilina/wc-out/_logs -rw-r--r-- 3 thilina supergroup 19908 2012-09-05 15:40 /user/thilina/wc-out/part-r-00000 $ hadoop fs -cat wc-out/part-* | more
- Issue the following command to shut down the Hadoop cluster. Make sure to download any important data before shutting down the cluster, as the data will be permanently lost after shutting down the cluster.
$bin/whirr destroy-cluster --config hadoop.properties
How it works...
The following are the descriptions of the properties we used in the hadoop.properties
file.
whirr.cluster-name=Hadoop-yarn
The preceding property provides a name for the cluster. The instances of the cluster will be tagged using this name.
whirr.instance-templates=1 hadoop-namenode+yarn-resource-manager+mapreduce-historyserver, 1 hadoop-datanode+yarn-nodemanager
This property specifies the number of instances to be used for each set of roles and the type of roles for the instances.
whirr.provider=aws-ec2
We use the Whirr Amazon EC2 provider to provision our cluster.
whirr.private-key-file=${sys:user.home}/.ssh/id_rsa whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
Both the properties mentioned earlier point to the paths of the private key and the public key you provide for the cluster.
whirr.hadoop.version=2.5.2
We specify a custom Hadoop version using the preceding property.
whirr.aws-ec2-spot-price=0.15
This property specifies a bid price for the Amazon EC2 Spot Instances. Specifying this property triggers Whirr to use EC2 Spot Instances for the cluster. If the bid price is not met, Apache Whirr Spot Instance requests a time out after 20 minutes. Refer to the Saving money using Amazon EC2 Spot Instances to execute EMR job flows recipe for more details.
More details on Whirr configuration can be found at http://whirr.apache.org/docs/0.8.1/configuration-guide.html.
See also
The Saving money using Amazon EC2 Spot Instances to execute EMR job flows recipe.
- Objective-C Memory Management Essentials
- 程序員面試白皮書
- Android和PHP開發最佳實踐(第2版)
- Arduino by Example
- Mastering SVG
- C/C++算法從菜鳥到達人
- Visual C++實例精通
- Web全棧工程師的自我修養
- Silverlight魔幻銀燈
- Java EE 7 Development with NetBeans 8
- VMware虛擬化技術
- Learning SciPy for Numerical and Scientific Computing(Second Edition)
- Windows Phone 7.5:Building Location-aware Applications
- Instant Nancy Web Development
- 深度學習:Java語言實現