需求:
- 展示电影 ID 为 2116 这部电影各年龄段的平均影评分。
- 找出男性评分最高且评分次数超过 50 次的 10 部电影,展示电影名,平均影评分和评分次数。
先把用到的表都建好:
(保护数据安全,hdfs路径省略)
CREATE EXTERNAL TABLE `t_movie_dcx`(
`movieid` int COMMENT '电影 ID',
`moviename` string COMMENT '电影名',
`movietype` string COMMENT '电影类型')
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES (
'field.delim'='::')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://'
TBLPROPERTIES (
'bucketing_version'='2',
'transient_lastDdlTime'='1648967349')
CREATE EXTERNAL TABLE `t_user_dcx`(
`userid` int COMMENT '',
`sex` string COMMENT '',
`age` int COMMENT '',
`occupation` int COMMENT '职业',
`zipcode` int COMMENT '邮编')
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES (
'field.delim'='::')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://xxx'
TBLPROPERTIES (
'bucketing_version'='2',
'transient_lastDdlTime'='1648967215');
CREATE EXTERNAL TABLE `t_rating_dcx`(
`userid` int COMMENT '',
`movieid` int COMMENT '',
`rate` int COMMENT '评分',
`times` bigint COMMENT '评分时间')
ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES (
'field.delim'='::')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://xxx'
TBLPROPERTIES (
'bucketing_version'='2',
'transient_lastDdlTime'='1648967291');
1. 展示电影 ID 为 2116 这部电影各年龄段的平均影评分。
select u.age as age , avg(r.rate) as avgrate
from t_rating_dcx r
join t_user_dcx u on r.userid = u.userid
where r.movieid = 2116 group by u.age;
解析:
特定条件 要有where
平均评分 要用avg()
各年龄段 要有分组
涉及多表 要有关联join
- 找出男性评分最高且评分次数超过 50 次的 10 部电影,展示电影名,平均影评分和评分次数。
两种写法,区别在于group by字段:
select collect_set(u.sex)[0], collect_set(m.moviename)[0] ,avg(r.rate) as avgrate,count(m.movieid) as total
from t_rating_dcx r
join t_user_dcx u on r.userid = u .userid
join t_movie_dcx m on r.movieid = m.movieid
where u.sex = "M"
group by m.movieid
having total > 50
order by avgrate desc
limit 10;
select u.sex, m.moviename,avg(r.rate) as avgrate,count(m.moviename) as total
from t_rating_dcx r
join t_user_dcx u on r.userid = u .userid
join t_movie_dcx m on r.movieid = m.movieid
where u.sex = "M"
group by m.moviename , u.sex
having total > 50
order by avgrate desc
limit 10;
解析:
top10 要有order by desc limit
统计电影评分 次数 , 要有聚合子查询 having count(movie) >
核心是找top10 电影 ,所以要有group by movie,用movieId和name都可以
用到了三表的字段,所以都要关联,关联顺序无所谓.最终都是一张大宽表.
遇到的问题:
Expression Not In Group By Key
解决:
Hive 中所有 select 后面非聚合函数字段, 都要出现在 group by 后面,如果不想group by非聚合字段,可以使用collect_set()包裹改字段,返回一个数组,使用数组下标访问数据.