- Hadoop MapReduce v2 Cookbook(Second Edition)
- Thilina Gunarathne
- 418字
- 2021-07-23 20:32:54
Introduction
In this chapter, we will explore several mechanisms to deploy and execute Hadoop MapReduce v2 and other Hadoop-related computations on cloud environments.
Cloud computing environments such as Amazon EC2 and Microsoft Azure provide on-demand compute and storage resources as a service over the Web. These cloud computing environments enable us to perform occasional large-scale Hadoop computations without an upfront capital investment and require us to pay only for the actual usage. Another advantage of using cloud environments is the ability to increase the throughput of the Hadoop computations by horizontally scaling the number of computing resources with minimal additional cost. For an example, the cost for using 10 cloud instances for 100 hours equals the cost of using 100 cloud instances for 10 hours. In addition to storage, compute, and hosted MapReduce services, these cloud environments provide many other distributed computing services as well, which you may find useful when implementing your overall application architecture.
While the cloud environments provide many advantages over their traditional counterparts, they also come with several unique reliability and performance challenges due to the virtualized, multi-tenant nature of the infrastructure. With respect to the data-intensive Hadoop computations, one of the major challenges would be the transfer of large datasets in and out of the cloud environments. We also need to make sure to use a persistent storage medium to store any data that you need to preserve. Any data that is stored in the ephemeral instance storage of the cloud instances would be lost at the termination of those instances.
We will mainly be using the Amazon AWS cloud for the recipes in this chapter due to the maturity of the Linux instance support and the maturity of hosted Hadoop services compared to the other commercial cloud offerings such as Microsoft Azure cloud.
This chapter guides you on using Amazon Elastic MapReduce (EMR), which is the hosted Hadoop infrastructure, to execute traditional MapReduce computations as well as Pig and Hive computations on the Amazon EC2 infrastructure. This chapter also presents how to provision an HBase cluster using Amazon EMR and how to back up and restore the data of an EMR HBase cluster. We will also use Apache Whirr, a cloud neutral library for deploying services on cloud environments, to provision Apache Hadoop and Apache HBase clusters on cloud environments.
Tip
Sample code
The example code files for this book are available in the https://github.com/thilg/hcb-v2 repository. The chapter2
folder of the code repository contains the sample source code files for this chapter.
- Java逍遙游記
- Clojure Programming Cookbook
- Deploying Node.js
- Learn TypeScript 3 by Building Web Applications
- AngularJS Web Application Development Blueprints
- 匯編語言程序設(shè)計(jì)(第2版)
- UI智能化與前端智能化:工程技術(shù)、實(shí)現(xiàn)方法與編程思想
- Building Mapping Applications with QGIS
- 計(jì)算機(jī)應(yīng)用基礎(chǔ)案例教程
- 并行編程方法與優(yōu)化實(shí)踐
- Mastering Leap Motion
- 深入淺出 HTTPS:從原理到實(shí)戰(zhàn)
- 產(chǎn)品架構(gòu)評(píng)估原理與方法
- 程序員的英語
- Java網(wǎng)絡(luò)編程實(shí)用精解