- Hadoop MapReduce v2 Cookbook(Second Edition)
- Thilina Gunarathne
- 1027字
- 2021-07-23 20:32:52
Writing a WordCount MapReduce application, bundling it, and running it using the Hadoop local mode
This recipe explains how to implement a simple MapReduce program to count the number of occurrences of words in a dataset. WordCount is famous as the HelloWorld equivalent for Hadoop MapReduce.
To run a MapReduce job, users should supply a map
function, a reduce
function, input data, and a location to store the output data. When executed, Hadoop carries out the following steps:
- Hadoop uses the supplied InputFormat to break the input data into key-value pairs and invokes the
map
function for each key-value pair, providing the key-value pair as the input. When executed, themap
function can output zero or more key-value pairs. - Hadoop transmits the key-value pairs emitted from the Mappers to the Reducers (this step is called Shuffle). Hadoop then sorts these key-value pairs by the key and groups together the values belonging to the same key.
- For each distinct key, Hadoop invokes the reduce function once while passing that particular key and list of values for that key as the input.
- The
reduce
function may output zero or more key-value pairs, and Hadoop writes them to the output data location as the final result.
Getting ready
Select the source code for the first chapter from the source code repository for this book. Export the $HADOOP_HOME
environment variable pointing to the root of the extracted Hadoop distribution.
How to do it...
Now let's write our first Hadoop MapReduce program:
- The WordCount sample uses MapReduce to count the number of word occurrences within a set of input documents. The sample code is available in the
chapter1/Wordcount.java
file of the source folder of this chapter. The code has three parts—Mapper, Reducer, and the main program. - The Mapper extends from the
org.apache.hadoop.mapreduce.Mapper
interface. Hadoop InputFormat provides each line in the input files as an input key-value pair to themap
function. Themap
function breaks each line into substrings using whitespace characters such as the separator, and for each token (word) emits(word,1)
as the output.public void map(Object key, Text value, Context context) throws IOException, InterruptedException { // Split the input text value to words StringTokenizer itr = new StringTokenizer(value.toString()); // Iterate all the words in the input text value while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, new IntWritable(1)); } }
- Each
reduce
function invocation receives a key and all the values of that key as the input. Thereduce
function outputs the key and the number of occurrences of the key as the output.public void reduce(Text key, Iterable<IntWritable>values, Context context) throws IOException, InterruptedException { int sum = 0; // Sum all the occurrences of the word (key) for (IntWritableval : values) { sum += val.get(); } result.set(sum); context.write(key, result); }
- The
main
driver program configures the MapReduce job and submits it to the Hadoop YARN cluster:Configuration conf = new Configuration(); …… // Create a new job Job job = Job.getInstance(conf, "word count"); // Use the WordCount.class file to point to the job jar job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // Setting the input and output locations FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, newPath(otherArgs[1])); // Submit the job and wait for it's completion System.exit(job.waitForCompletion(true) ? 0 : 1);
- Compile the sample using the Gradle build as mentioned in the introduction of this chapter by issuing the
gradle build
command from thechapter1
folder of the sample code repository. Alternatively, you can also use the provided Apache Ant build file by issuing theant compile
command. - Run the WordCount sample using the following command. In this command,
chapter1.WordCount
is the name of themain
class.wc-input
is the input data directory andwc-output
is the output path. Thewc-input
directory of the source repository contains a sample text file. Alternatively, you can copy any text file to thewc-input
directory.$ $HADOOP_HOME/bin/hadoop jar \ hcb-c1-samples.jar \ chapter1.WordCount wc-input wc-output
- The output directory (
wc-output
) will have a file namedpart-r-XXXXX
, which will have the count of each word in the document. Congratulations! You have successfully run your first MapReduce program.$ cat wc-output/part*
How it works...
In the preceding sample, MapReduce worked in the local mode without starting any servers and using the local filesystem as the storage system for inputs, outputs, and working data. The following diagram shows what happened in the WordCount program under the covers:

The WordCount MapReduce workflow works as follows:
- Hadoop reads the input, breaks it using new line characters as the separator and then runs the
map
function passing each line as an argument with the line number as the key and the line contents as the value. - The
map
function tokenizes the line, and for each token (word), emits a key-value pair(word,1)
. - Hadoop collects all the
(word,1)
pairs, sorts them by the word, groups all the values emitted against each unique key, and invokes thereduce
function once for each unique key passing the key and values for that key as an argument. - The
reduce
function counts the number of occurrences of each word using the values and emits it as a key-value pair. - Hadoop writes the final output to the output directory.
There's more...
As an optional step, you can set up and run the WordCount application directly from your favorite Java Integrated Development Environment (IDE). Project files for Eclipse IDE and IntelliJ IDEA IDE can be generated by running gradle eclipse
and gradle idea
commands respectively in the main folder of the code repository.
For other IDEs, you'll have to add the JAR files in the following directories to the class-path of the IDE project you create for the sample code:
{HADOOP_HOME}/share/hadoop/common
{HADOOP_HOME}/share/hadoop/common/lib
{HADOOP_HOME}/share/hadoop/mapreduce
{HADOOP_HOME}/share/hadoop/yarn
{HADOOP_HOME}/share/hadoop/hdfs
Execute the chapter1.WordCount
class by passing wc-input
and wc-output
as arguments. This will run the sample as before. Running MapReduce jobs from IDE in this manner is very useful for debugging your MapReduce jobs.
See also
Although you ran the sample with Hadoop installed in your local machine, you can run it using the distributed Hadoop cluster setup with an HDFS-distributed filesystem. The Running the WordCount program in a distributed cluster environment recipe of this chapter will discuss how to run this sample in a distributed setup.
- Learning Cython Programming
- Mastering Adobe Captivate 2017(Fourth Edition)
- Linux環境編程:從應用到內核
- 精通MATLAB(第3版)
- Mastering React
- Babylon.js Essentials
- Python程序設計與算法基礎教程(第2版)(微課版)
- Principles of Strategic Data Science
- Maker基地嘉年華:玩轉樂動魔盒學Scratch
- C陷阱與缺陷
- Drupal 8 Development:Beginner's Guide(Second Edition)
- 零基礎學Java(第5版)
- C#程序設計基礎與實踐
- Unity與C++網絡游戲開發實戰:基于VR、AI與分布式架構
- Learning Ext JS(Fourth Edition)