官术网_书友最值得收藏!

  • Mastering Hadoop
  • Sandeep Karanth
  • 226字
  • 2021-08-06 19:53:00

The RecordReader class

Unlike InputSplit, the RecordReader class presents a record view of the data to the Map task. RecordReader works within each InputSplit class and generates records from the data in the form of key-value pairs. The InputSplit boundary is a guideline for RecordReader and is not enforced. On one extreme, a custom RecordReader class can be written to read an entire file (though this is not encouraged). Most often, a RecordReader class will have to read from a subsequent InputSplit class to present the complete record to the Map task. This happens when records overlap InputSplit classes.

The reading of bytes from a subsequent InputSplit class happens via the FSDataInputS tream objects. Though this reading does not respect locality in itself, generally, it gathers only a few bytes from the next split and there is not a significant performance overhead. But in some cases where record sizes are huge, this can have a bearing on the performance due to significant byte transfers across nodes.

In the following diagram, a file with two HDFS blocks has the record R5 spanning both blocks. It is assumed that the minimum split size is less than the block size. In this case, RecordReader is going to gather the complete record by reading bytes off the next block of data.

The RecordReader class

File with two blocks and record R5 spanning blocks

主站蜘蛛池模板: 合江县| 锡林郭勒盟| 宁河县| 明星| 南岸区| 鄂尔多斯市| 昭苏县| 福鼎市| 庆城县| 海门市| 枣庄市| 吉隆县| 马公市| 武义县| 鹤峰县| 酒泉市| 平泉县| 阿拉善盟| 肥东县| 玉林市| 措勤县| 琼海市| 民权县| 德兴市| 阜新市| 孟连| 吉隆县| 灵璧县| 通渭县| 铅山县| 元谋县| 宝丰县| 嘉荫县| 剑河县| 邯郸市| 个旧市| 乐平市| 西安市| 喀喇沁旗| 东乡| 太湖县|