官术网_书友最值得收藏!

MapReduce input

The Map step of a MapReduce job hinges on the nature of the input provided to the job. The Map step provides maximum parallelism gains, and crafting this step smartly is important for job speedup. Data is split into chunks, and Map tasks operate on each of these chunks of data. Each chunk is called InputSplit. A Map task is asked to operate on each InputSplit class. There are two other classes, InputFormat and RecordReader, which are significant in handling inputs to Hadoop jobs.

The InputFormat class

The input data specification for a MapReduce Hadoop job is given via the InputFormat hierarchy of classes. The InputFormat class family has the following main functions:

  • Validating the input data. For example, checking for the presence of the file in the given path.
  • Splitting the input data into logical chunks (InputSplit) and assigning each of the splits to a Map task.
  • Instantiating a RecordReader object that can work on each InputSplit class and producing records to the Map task as key-value pairs.

The FileInputFormat subclass and associated classes are commonly used for jobs that take inputs from HFDS. The DBInputFormat subclass is a specialized class that can be used to read data from a SQL database. CombineFileInputFormat is the direct abstract subclass of the FileInputFormat class, which can combine multiple files into a single split.

The InputSplit class

The abstract InputSplit class and its associated concrete classes represent a byte view of the input data. An InputSplit class is characterized by the following main attributes:

  • The input filename
  • The byte offset in the file where the split starts
  • The length of the split in bytes
  • The node locations where the split resides

In HDFS, an InputSplit class is created per file if the file size is less than the HDFS block size. For example, if the HDFS block size is 128 MB, any file with a size less than 128 MB resides in its own InputSplit class. For files that are broken up into blocks (size of the file is greater than the HDFS block size), a more complex formula is used to calculate InputSplit. The InputSplit class has an upper bound of the HDFS block size, unless the minimum size of a split is greater than the block size. Such cases are rare and could lead to locality problems.

Based on the locations of the splits and the availability of resources, the scheduler makes a decision on which node should execute the Map task for that split. The split is then communicated to the node that executes the task.

InputSplitSize = Maximum(minSplitSize, Minimum(blocksize, maxSplitSize)) 

minSplitSize : mapreduce.input.fileinputformat.split.minsize

blocksize: dfs.blocksize 
maxSplitSize - mapreduce.input.fileinputformat.split.maxsize

Note

In previous releases of Hadoop, the minimum split size property was mapred.min.split.size and the maximum split size was given by the value of the property mapred.max.split.size. These are deprecated now.

主站蜘蛛池模板: 新泰市| 安徽省| 林州市| 桓台县| 滨海县| 黔南| 格尔木市| 景泰县| 堆龙德庆县| 谢通门县| 卫辉市| 武乡县| 湘阴县| 玉屏| 海阳市| 宽甸| 桂东县| 伊宁县| 黑水县| 彩票| 龙海市| 利川市| 织金县| 河西区| 巴林左旗| 南岸区| 习水县| 东兰县| 炎陵县| 乡城县| 九龙县| 昭苏县| 葵青区| 泰兴市| 淮北市| 锡林郭勒盟| 平湖市| 林甸县| 墨玉县| 攀枝花市| 仙桃市|