官术网_书友最值得收藏!

Introduction

In this chapter, we will explore several mechanisms to deploy and execute Hadoop MapReduce v2 and other Hadoop-related computations on cloud environments.

Cloud computing environments such as Amazon EC2 and Microsoft Azure provide on-demand compute and storage resources as a service over the Web. These cloud computing environments enable us to perform occasional large-scale Hadoop computations without an upfront capital investment and require us to pay only for the actual usage. Another advantage of using cloud environments is the ability to increase the throughput of the Hadoop computations by horizontally scaling the number of computing resources with minimal additional cost. For an example, the cost for using 10 cloud instances for 100 hours equals the cost of using 100 cloud instances for 10 hours. In addition to storage, compute, and hosted MapReduce services, these cloud environments provide many other distributed computing services as well, which you may find useful when implementing your overall application architecture.

While the cloud environments provide many advantages over their traditional counterparts, they also come with several unique reliability and performance challenges due to the virtualized, multi-tenant nature of the infrastructure. With respect to the data-intensive Hadoop computations, one of the major challenges would be the transfer of large datasets in and out of the cloud environments. We also need to make sure to use a persistent storage medium to store any data that you need to preserve. Any data that is stored in the ephemeral instance storage of the cloud instances would be lost at the termination of those instances.

We will mainly be using the Amazon AWS cloud for the recipes in this chapter due to the maturity of the Linux instance support and the maturity of hosted Hadoop services compared to the other commercial cloud offerings such as Microsoft Azure cloud.

This chapter guides you on using Amazon Elastic MapReduce (EMR), which is the hosted Hadoop infrastructure, to execute traditional MapReduce computations as well as Pig and Hive computations on the Amazon EC2 infrastructure. This chapter also presents how to provision an HBase cluster using Amazon EMR and how to back up and restore the data of an EMR HBase cluster. We will also use Apache Whirr, a cloud neutral library for deploying services on cloud environments, to provision Apache Hadoop and Apache HBase clusters on cloud environments.

Tip

Sample code

The example code files for this book are available in the https://github.com/thilg/hcb-v2 repository. The chapter2 folder of the code repository contains the sample source code files for this chapter.

主站蜘蛛池模板: 炎陵县| 宣汉县| 当雄县| 邻水| 金寨县| 江永县| 津市市| 鱼台县| 阜平县| 隆德县| 常山县| 讷河市| 施甸县| 桦川县| 专栏| 胶南市| 香格里拉县| 永新县| 剑阁县| 乐业县| 景洪市| 台东县| 五原县| 大石桥市| 于都县| 宁乡县| 逊克县| 汾阳市| 海伦市| 临海市| 大庆市| 探索| 八宿县| 重庆市| 梅州市| 许昌市| 静安区| 江安县| 临清市| 无极县| 双江|