MapReduce核心类 · Hadoop2.x

![](https://img.kancloud.cn/f7/70/f770105fd9104aabc3e424915f070811_922x555.png) [TOC] # 1. InputFormat InputFormat 的主要功能就是确定每一个 map 任务需要读取哪些数据以及如何读取数据的问题，<ins>每一个 map 读取哪些数据由 InputSplit（数据切片）决定，如何读取数据由 RecordReader 来决定</ins>。InputFormat 中就有获取 InputSplit 和RecordReader 的方法。 ![](https://img.kancloud.cn/b2/b7/b2b7cd0e46a25b58a5c5fe7a8a900cb5_1129x474.png) **InputSplit:** 在map之前，根据输入文件InputSplit会被创建。 * 每个InputSplit对应一个Mapper任务 * 输入分片存储的是分片长度和记录数据位置的数组 ![](https://img.kancloud.cn/cb/b4/cbb4dbb9868bcad5777dbeb0a89641c3_896x453.png) **block和split的区别：** * block是数据的物理表示、split是块中数据的逻辑表示 * split划分是在记录的边界处 * split的数量应不大于block的数量（一般相等） <br/> # 2. InputFormat 接口实现类 ![](https://img.kancloud.cn/dd/34/dd34c30fa28f9b5c4aeef6bb7dfd45d3_1153x501.png) InputFormat实现类有很多，但是我们开发比较常用应该是文件类型（FileInputFormat）和数据库类型（DBInputFormat）。课程中还是以FileInputFormat为主。DBInputFormat 只是知道有这个功能即可。 1. **FileInputFormat 源码解析**(该部分内容可参照 FileInputFormat 源码) ![](https://img.kancloud.cn/44/f9/44f92f16148a87147b97382c00c30987_1038x525.png) （1）找到输入数据存储的目录。（2）开始遍历处理（规划切片）目录下的每一个文件。（3）遍历第一个文件 hello.txt。      a）获取文件大小 fs.sizeOf(hello.txt)。      b）计算切片大小 <ins>computeSliteSize(Math.max(minSize,Math.min(maxSize,blocksize)))=blocksize=128M</ins>。      c）<ins>默认情况下，切片大小=blocksize</ins>。      d）开始切，形成第 1 个切片：hello.txt—0:128M ，第 2 个切片 hello.txt—128:256M ，第 3 个切片 hello.txt—256M:300M（<ins>每次切片时，都要判断切完剩下的部分是否大于块的 1.1 倍，不大于 1.1 倍就划分一块切片</ins>）。      e）将切片信息写到一个切片规划文件中。      f）整个切片的核心过程在 FileInputFormat 类中的 getSplit()方法中完成，可以去查看源码。      g）<ins>数据切片只是在逻辑上对输入数据进行分片，并不会在磁盘上将其切分成分片进行存储</ins>。InputSplit 只记录了分片的元数据信息，比如起始位置、长度以及所在的节点列表等。      h）注意：<ins>block 是 HDFS 物理上存储的数据，切片是对数据逻辑上的划分</ins>。（4）提交切片规划文件到 Yarn 上，Yarn 上的 MrAppMaster 就可以根据切片规划文件计算开启 maptask 个数。 2. **FileInputFormat 切片大小的参数配置** 通过分析源码，在 FileInputFormat 中，计算切片大小的逻辑：<ins>Math.max(minSize, Math.min(maxSize, blockSize))</ins>; 切片主要由这几个值来运算决定： ``` mapreduce.input.fileinputformat.split.minsize=1 默认值为 1 mapreduce.input.fileinputformat.split.maxsize=Long.MAXValue 默认值Long.MAXValue ``` 因此，默认情况下，切片大小=blocksize。 ``` maxsize（切片最大值）：参数如果调得比 blocksize 小，则会让切片变小，而且就等于配置的这个参数的值。 minsize（切片最小值）：参数调的比 blockSize 大，则可以让切片变得比blocksize 还大。 ``` 3. **获取切片信息 API，可以使用 MapTask 上下文对象获取切片信息** ```java // 根据文件类型获取切片信息 FileSplit inputSplit = (FileSplit) context.getInputSplit(); // 获取切片的文件名称 String name = inputSplit.getPath().getName(); ``` 4. **总结** FileInputFormat 默认切片规则（1）简单地按照文件的内容长度进行切片（2）切片大小，默认等于 block 大小（3）切片时不考虑数据集整体，而是逐个针对每一个文件单独切片 <br/> # 3. FileInputFormat 实现类 FileInputFormat 其实是一个抽象类，它有很多实现类。默认的是TextInputFormat。 1. **TextInputFormat** TextInputFormat 是默认的 InputFormat。每条记录是一行输入。<ins>键是LongWritable 类型，存储该行在整个文件中的字节偏移量。值是这行的内容，不包括任何行终止符（换行符和回车符）</ins>。以下是一个示例，比如，一个分片包含了如下 4 条文本记录。 ```txt Rich learning form Intelligent learning engine Learning more convenient From the real demand for more close to the enterprise ``` 每条记录表示为以下键/值对。 ```txt (0,Rich learning form) (19,Intelligent learning engine) (47,Learning more convenient) (72,From the real demand for more close to the enterprise) ``` 很明显，键并不是行号。一般情况下，很难取得行号，因为文件按字节而不是按行切分为分片。 2. **KeyValueTextInputFormat**(扩展内容) 每一行均为一条记录，被分隔符分割为 key，value。可以通过在驱动类中设置 conf.set(KeyValueLineRecordReader.KEY_VALUE_SEPERATOR, " ");来设定分隔符。默认分隔符是 tab（\t）。<br/> 以下是一个示例，输入是一个包含 4 条记录的分片。其中——>表示一个（水平方向的）制表符。 ``` line1 ——>Rich learning form line2 ——>Intelligent learning engine line3 ——>Learning more convenient line4 ——>From the real demand for more close to the enterprise ``` 每条记录表示为以下键/值对。 ``` (line1,Rich learning form) (line2,Intelligent learning engine) (line3,Learning more convenient) (line4,From the real demand for more close to the enterprise) ``` 此时的键是每行排在制表符之前的 Text 序列。 3. **NLineInputFormat**（扩展内容）如果使用NlineInputFormat，代表每个map 进程处理的InputSplit不再按block块去划分，而是按 NlineInputFormat 指定的行数 N 来划分。即`输入文件的总行数/N=切片数`，如果不整除，`切片数=商+1`。以下是一个示例，仍然以上面的 4 行输入为例。 ``` Rich learning form Intelligent learning engine Learning more convenient From the real demand for more close to the enterprise ``` 例如，如果 N 是 2，则每个输入分片包含两行。开启 2 个 maptask。 ``` (0,Rich learning form) (19,Intelligent learning engine) ``` 另一个 mapper 则收到后两行： ``` (47,Learning more convenient) (72,From the real demand for more close to the enterprise) ``` 这里的键和值与 TextInputFormat 生成的一样。