对于简单的hivesql,看懂执行计划能细致的知道怎么优化
对于如下这一段简单的sql,看下explain出来的内容
select prov_id
,deep
,count(1) as num
from dim.dim_city
where prov_id = 110000
and deep = 1
group by prov_id
,deep
;
执行计划
//描述任务之间的依赖关系
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
//每个stage的详细信息
STAGE PLANS:
//stage-1的详细信息
Stage: Stage-1
Map Reduce
//map阶段的操作
Map Operator Tree:
TableScan
//访问的表
alias: dim_city
//filterExpression filter描述信息
filterExpr: ((prov_id = 110000) and (deep = 1)) (type: boolean)
//当前阶段的数据统计信息 读取的数据行数和大小,这两个都是预估值
Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONE
//实际的filter操作,就是过滤数据
Filter Operator
predicate: ((prov_id = 110000) and (deep = 1)) (type: boolean)
//过滤之后的数据大小
Statistics: Num rows: 943 Data size: 130444 Basic stats: COMPLETE Column stats: NONE
//select字段
Select Operator
Statistics: Num rows: 943 Data size: 130444 Basic stats: COMPLETE Column stats: NONE
//之前的结果集上分组聚合
Group By Operator
//分组聚合使用的算法
aggregations: count()
keys: 110000 (type: int)
mode: hash
//这里是映射的列名
outputColumnNames: _col0, _col1
Statistics: Num rows: 943 Data size: 130444 Basic stats: COMPLETE Column stats: NONE
//当前描述的是对之前结果聚合后的输出信息,map端聚合的输出信息
Reduce Output Operator
//map和reduce阶段输出都是k-v形式,这里key -> _col0,value -> _col1
key expressions: _col0 (type: int)
//输出是否排序,+是正序
sort order: +
//map阶段输出到reduce阶段的分区列,即按照_col10进行分区
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 943 Data size: 130444 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
//接下来就到了reduce阶段
Reduce Operator Tree:
Group By Operator
//聚合操作是count,value中的_col0
aggregations: count(VALUE._col0)
//聚合的key
keys: KEY._col0 (type: int)
mode: mergepartial
//输出了2列,这里对应的是prov_id 和 deep
outputColumnNames: _col0, _col1
Statistics: Num rows: 471 Data size: 65152 Basic stats: COMPLETE Column stats: NONE
//上面的2列加上聚合的一列
Select Operator
expressions: 110000 (type: int), 1 (type: int), _col1 (type: bigint)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 471 Data size: 65152 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 471 Data size: 65152 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
//序列化的方式
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink