group by分组 · Hadoop2.x

`GROUP BY` 语句通常会和聚合函数一起使用，按照一个或者多个列进行分组，然后对每个组执行聚合操作。 ```sql -- 计算 emp 表每个部门的平均工资 -- select 中的字段必须是出现在group by中的字段，或是通过函数重新生成的字段 -- 如果查询 deptno 字段，则该字段必须出现在group by中 select t.deptno, avg(t.sal) avg_sal from emp t group by t.deptno; -- 计算 emp 每个部门中每个岗位的最高薪水 select t.deptno, t.job, max(t.sal) max_sal from emp t group by t.deptno, t.job; ``` <br/> **group by注意事项：** ```sql （1）group by不能与distinct一起使用； -- 语法错误 select distinct id from student group by id; （2）除了聚合函数外，select 的字段必须出现在group by中，否则语法错误 -- 正确，id和name字段都出现在group by中了 select id,name from student group by id,name; -- 正确，age字段虽然没有出现在group by中，但avg是聚合函数 select id, avg(age) as avg_age from student group by id; -- 错误，因为age没有出现在group by中，floor也不是聚合函数 select id, floor(age) as avg_age from student group by id; -- 正确 select id, floor(age) as avg_age from student group by id,age; ``` **`group by`的特性：** （1）使用了 reduce 操作，受限于 reduce 数量，通过设置参数 `mapred.reduce.tasks ` 设置 reduce 个数。（2）输出文件个数与 reduce 数量相同，输出的单个文件大小与 reduce 处理的数量有关。 <br/> **`group by`问题：** （1）网络负载过重。（2）可能出现数据倾斜的情况(我们可以通过设置参数 `hive.groupby.skewindata`来优化数据倾斜的问题)。