- Programming MapReduce with Scalding
- Antonios Chalkiopoulos
- 533字
- 2021-12-08 12:44:21
MapReduce
Massive parallel processing of large datasets is a complex process. MapReduce simplifies this by providing a design pattern that instructs algorithms to be expressed in map and reduce phases. Map can be used to perform simple transformations on data, and reduce is used to group data together and perform aggregations.
By chaining together a number of map and reduce phases, sophisticated algorithms can be achieved. The shared nothing architecture of MapReduce prohibits communication between map tasks of the same phase or reduces tasks of the same phase. Communication that's required happens at the end of each phase.
The simplicity of this model allows Hadoop to translate each phase, depending on the amount of data that needs to be processed into tens or even hundreds of tasks being executed in parallel, thus achieving scalable performance.
Internally, the map and reduce tasks follow a simplistic data representation. Everything is a key or a value. A map task receives key-value pairs and applies basic transformations emitting new key-value pairs. Data is then partitioned and different partitions are transmitted to different reduce tasks. A reduce task also receives key-value pairs, groups them based on the key, and applies basic transformation to those groups.
A MapReduce example
To illustrate how MapReduce works, let's look at an example of a log file of total size 1 GB with the following format:
INFO MyApp - Entering application. WARNING com.foo.Bar - Timeout accessing DB - Retrying ERROR com.foo.Bar - Did it again! INFO MyApp - Exiting application
Tip
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Once this file is stored in HDFS, it is split into eight 128 MB blocks and distributed in multiple Hadoop nodes. In order to build a MapReduce job to count the amount of INFO, WARNING, and ERROR log lines in the file, we need to think in terms of map and reduce phases.
In one map phase, we can read local blocks of the file and map each line to a key and a value. We can use the log level as the key and the number 1
as the value. After it is completed, data is partitioned based on the key and transmitted to the reduce tasks.
MapReduce guarantees that the input to every reducer is sorted by key. Shuffle is the process of sorting and copying the output of the map tasks to the reducers to be used as input. By setting the value to 1
on the map phase, we can easily calculate the total in the reduce phase. Reducers receive input sorted by key, aggregate counters, and store results.
In the following diagram, every green block represents an INFO message, every yellow block a WARNING message, and every red block an ERROR message:

Implementing the preceding MapReduce algorithm in Java requires the following three classes:
- A Map class to map lines into
<key,value>
pairs; for example,<"INFO",1>
- A Reduce class to aggregate counters
- A Job configuration class to define input and output types for all
<key,value>
pairs and the input and output files
- Learn ECMAScript(Second Edition)
- HTML5+CSS3王者歸來(lái)
- C語(yǔ)言程序設(shè)計(jì)實(shí)踐教程(第2版)
- Kibana Essentials
- Java 開(kāi)發(fā)從入門(mén)到精通(第2版)
- Redis Applied Design Patterns
- FreeSWITCH 1.6 Cookbook
- 算法大爆炸:面試通關(guān)步步為營(yíng)
- Hands-On Microservices with Kotlin
- FFmpeg入門(mén)詳解:音視頻原理及應(yīng)用
- HTML5與CSS3基礎(chǔ)教程(第8版)
- PHP 7從零基礎(chǔ)到項(xiàng)目實(shí)戰(zhàn)
- Cocos2d-x by Example:Beginner's Guide(Second Edition)
- SOA Patterns with BizTalk Server 2013 and Microsoft Azure(Second Edition)
- 前端Serverless:面向全棧的無(wú)服務(wù)器架構(gòu)實(shí)戰(zhàn)