- Mastering Hadoop
- Sandeep Karanth
- 226字
- 2021-08-06 19:53:00
The RecordReader class
Unlike InputSplit
, the RecordReader
class presents a record view of the data to the Map task. RecordReader
works within each InputSplit
class and generates records from the data in the form of key-value pairs. The InputSplit
boundary is a guideline for RecordReader
and is not enforced. On one extreme, a custom RecordReader
class can be written to read an entire file (though this is not encouraged). Most often, a RecordReader
class will have to read from a subsequent InputSplit
class to present the complete record to the Map task. This happens when records overlap InputSplit classes.
The reading of bytes from a subsequent InputSplit
class happens via the FSDataInputS
tream
objects. Though this reading does not respect locality in itself, generally, it gathers only a few bytes from the next split and there is not a significant performance overhead. But in some cases where record sizes are huge, this can have a bearing on the performance due to significant byte transfers across nodes.
In the following diagram, a file with two HDFS blocks has the record R5 spanning both blocks. It is assumed that the minimum split size is less than the block size. In this case, RecordReader
is going to gather the complete record by reading bytes off the next block of data.

File with two blocks and record R5 spanning blocks
- 3D Printing with RepRap Cookbook
- TestStand工業自動化測試管理(典藏版)
- Learning Apache Spark 2
- Julia 1.0 Programming
- 大學計算機應用基礎
- 人工智能實踐錄
- Enterprise PowerShell Scripting Bootcamp
- 步步圖解自動化綜合技能
- 基于Xilinx ISE的FPAG/CPLD設計與應用
- 未來學徒:讀懂人工智能飛馳時代
- Photoshop CS4數碼照片處理入門、進階與提高
- Raspberry Pi Projects for Kids
- 西門子S7-1200/1500 PLC從入門到精通
- JSP通用范例開發金典
- 教育創新與創新人才:信息技術人才培養改革之路(四)