官术网_书友最值得收藏!

MapReduce input

The Map step of a MapReduce job hinges on the nature of the input provided to the job. The Map step provides maximum parallelism gains, and crafting this step smartly is important for job speedup. Data is split into chunks, and Map tasks operate on each of these chunks of data. Each chunk is called InputSplit. A Map task is asked to operate on each InputSplit class. There are two other classes, InputFormat and RecordReader, which are significant in handling inputs to Hadoop jobs.

The InputFormat class

The input data specification for a MapReduce Hadoop job is given via the InputFormat hierarchy of classes. The InputFormat class family has the following main functions:

  • Validating the input data. For example, checking for the presence of the file in the given path.
  • Splitting the input data into logical chunks (InputSplit) and assigning each of the splits to a Map task.
  • Instantiating a RecordReader object that can work on each InputSplit class and producing records to the Map task as key-value pairs.

The FileInputFormat subclass and associated classes are commonly used for jobs that take inputs from HFDS. The DBInputFormat subclass is a specialized class that can be used to read data from a SQL database. CombineFileInputFormat is the direct abstract subclass of the FileInputFormat class, which can combine multiple files into a single split.

The InputSplit class

The abstract InputSplit class and its associated concrete classes represent a byte view of the input data. An InputSplit class is characterized by the following main attributes:

  • The input filename
  • The byte offset in the file where the split starts
  • The length of the split in bytes
  • The node locations where the split resides

In HDFS, an InputSplit class is created per file if the file size is less than the HDFS block size. For example, if the HDFS block size is 128 MB, any file with a size less than 128 MB resides in its own InputSplit class. For files that are broken up into blocks (size of the file is greater than the HDFS block size), a more complex formula is used to calculate InputSplit. The InputSplit class has an upper bound of the HDFS block size, unless the minimum size of a split is greater than the block size. Such cases are rare and could lead to locality problems.

Based on the locations of the splits and the availability of resources, the scheduler makes a decision on which node should execute the Map task for that split. The split is then communicated to the node that executes the task.

InputSplitSize = Maximum(minSplitSize, Minimum(blocksize, maxSplitSize)) 

minSplitSize : mapreduce.input.fileinputformat.split.minsize

blocksize: dfs.blocksize 
maxSplitSize - mapreduce.input.fileinputformat.split.maxsize

Note

In previous releases of Hadoop, the minimum split size property was mapred.min.split.size and the maximum split size was given by the value of the property mapred.max.split.size. These are deprecated now.

主站蜘蛛池模板: 南京市| 乐东| 鄂尔多斯市| 淄博市| 高要市| 共和县| 乌拉特前旗| 绥棱县| 晋城| 犍为县| 迭部县| 塔城市| 县级市| 宁蒗| 修文县| 白玉县| 富川| 九江市| 深州市| 威宁| 上饶县| 武夷山市| 睢宁县| 大方县| 新郑市| 吉水县| 和平区| 淮北市| 舟曲县| 乌兰浩特市| 碌曲县| 商河县| 黑水县| 阜阳市| 嘉峪关市| 吉水县| 资阳市| 牙克石市| 资中县| 新竹县| 报价|