7.2 MapReduce(下) · Hadoop

一、Combiner 如果我们有10亿个数据，Mapper会生成10亿个键值对在网络间进行传输，但如果我们只是对数据求最大值，那么很明显的Mapper只需要输出它所知道的最大值即可。这样做不仅可以减轻网络压力，同样也可以大幅度提高程序效率。每一个map都可能会产生大量的本地输出，Combiner的作用是把一个map产生的多个<KEY,VALUE>合并成一个新的<KEY,VALUE>,然后再将新<KEY,VALUE>作为reduce的输入，以减少在map和reduce节点之间的数据传输量，以提高网络IO性能。 Combiner适用的场景并不是所有情况下都能使用Combiner Combiner适用场景对记录汇总的场景（如求和）求最大值、最小值不适用场景求平均数自定义Combiner ``` public static class HotCombiner extends Reducer<Text, LongWritable, Text, LongWritable> { protected void reduce( Text key, java.lang.Iterable<LongWritable> values, org.apache.hadoop.mapreduce.Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws java.io.IOException, InterruptedException { …… } 添加设置Combiner的代码: job.setCombinerClass(HotCombiner.class); ``` 2. Partitioner 决定了Reducer节点的数量，比如我们有2018年航班信息的数据，对于每个月输出一个文件，我们就应该自定义Partitioner，实现12个Reducer节点，每个节点输出一个文件。 ``` public static class MyPartition extends Partitioner<Text, IntWritable>{ public int getPartition(Text key, IntWritable value, int num) { …… } 添加设置Partitioner的代码: job.setPartitionerClass(MyPartition.class); job.setNumReduceTasks(12); ```