- Mastering Hadoop
- Sandeep Karanth
- 472字
- 2021-08-06 19:52:59
MapReduce input
The Map step of a MapReduce job hinges on the nature of the input provided to the job. The Map step provides maximum parallelism gains, and crafting this step smartly is important for job speedup. Data is split into chunks, and Map tasks operate on each of these chunks of data. Each chunk is called InputSplit
. A Map task is asked to operate on each InputSplit
class. There are two other classes, InputFormat
and RecordReader
, which are significant in handling inputs to Hadoop jobs.
The InputFormat class
The input data specification for a MapReduce Hadoop job is given via the InputFormat
hierarchy of classes. The InputFormat
class family has the following main functions:
- Validating the input data. For example, checking for the presence of the file in the given path.
- Splitting the input data into logical chunks (
InputSplit
) and assigning each of the splits to a Map task. - Instantiating a
RecordReader
object that can work on eachInputSplit
class and producing records to the Map task as key-value pairs.
The FileInputFormat
subclass and associated classes are commonly used for jobs that take inputs from HFDS. The DBInputFormat
subclass is a specialized class that can be used to read data from a SQL database. CombineFileInputFormat
is the direct abstract subclass of the FileInputFormat
class, which can combine multiple files into a single split.
The InputSplit class
The abstract InputSplit
class and its associated concrete classes represent a byte view of the input data. An InputSplit
class is characterized by the following main attributes:
- The input filename
- The byte offset in the file where the split starts
- The length of the split in bytes
- The node locations where the split resides
In HDFS, an InputSplit
class is created per file if the file size is less than the HDFS block size. For example, if the HDFS block size is 128 MB, any file with a size less than 128 MB resides in its own InputSplit
class. For files that are broken up into blocks (size of the file is greater than the HDFS block size), a more complex formula is used to calculate InputSplit
. The InputSplit
class has an upper bound of the HDFS block size, unless the minimum size of a split is greater than the block size. Such cases are rare and could lead to locality problems.
Based on the locations of the splits and the availability of resources, the scheduler makes a decision on which node should execute the Map task for that split. The split is then communicated to the node that executes the task.
InputSplitSize = Maximum(minSplitSize, Minimum(blocksize, maxSplitSize)) minSplitSize : mapreduce.input.fileinputformat.split.minsize blocksize: dfs.blocksize maxSplitSize - mapreduce.input.fileinputformat.split.maxsize
- 高效能辦公必修課:Word圖文處理
- 自動控制工程設(shè)計入門
- Getting Started with Clickteam Fusion
- Getting Started with MariaDB
- 離散事件系統(tǒng)建模與仿真
- Photoshop CS3特效處理融會貫通
- 大數(shù)據(jù)技術(shù)與應(yīng)用
- 運動控制器與交流伺服系統(tǒng)的調(diào)試和應(yīng)用
- 項目管理成功利器Project 2007全程解析
- 分?jǐn)?shù)階系統(tǒng)分析與控制研究
- 中國戰(zhàn)略性新興產(chǎn)業(yè)研究與發(fā)展·智能制造裝備
- 從零開始學(xué)PHP
- 零起點學(xué)西門子S7-200 PLC
- Unity Multiplayer Games
- 一步步寫嵌入式操作系統(tǒng)