我们学习一下分桶表,其实分区和分桶这两个概念对于初学者来说是比较难理解的。但对于理解了的人来说,发现又是如此简单。
我们先建立一个分桶表,并尝试直接上传一个数据
create table student4(sno int,sname string,sex string,sage int, sdept string) clustered by(sno) into 3 buckets row format delimited fields terminated by ','; set hive.enforce.bucketing = true;强制分桶。 load data local inpath '/home/hadoop/hivedata/students.txt' overwrite into table student4;
我们看到虽然设置了强制分桶,但实际student表下面只有一个students一个文件。分桶也就是分区,分区数量等于文件数,所以上面方法并没有分桶。
现在,我们用插入的方法给另外一个分桶表传入同样数据
create table student4(sno int,sname string,sex string,sage int, sdept string) clustered by(sno) into 3 buckets row format delimited fields terminated by ','; set hive.enforce.bucketing = true;强制分桶。 load data local inpath '/home/hadoop/hivedata/students.txt' overwrite into table student4; 我们看到虽然设置了强制分桶,但实际STUDENT表下面只有一个STUDENTS一个文件。 分桶也就是分区,分区数量等于文件数,所以上面方法并没有分桶。 #创建第2个分桶表 create table stu_buck(sno int,sname string,sex string,sage int,sdept string) clustered by(sno) sorted by(sno DESC) into 4 buckets row format delimited fields terminated by ','; #设置变量,设置分桶为true, 设置reduce数量是分桶的数量个数 set hive.enforce.bucketing = true; set mapreduce.job.reduces=4; #开会往创建的分通表插入数据(插入数据需要是已分桶, 且排序的) #可以使用distribute by(sno) sort by(sno asc) 或是排序和分桶的字段相同的时候使用Cluster by(字段) #注意使用cluster by 就等同于分桶+排序(sort) insert into table stu_buck select sno,sname,sex,sage,sdept from student distribute by(sno) sort by(sno asc); Query ID = root_20171109145012_7088af00-9356-46e6-a988-f1fc5f6d2e13 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 4 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1510197346181_0014, Tracking URL = http://server71:8088/proxy/application_1510197346181_0014/ Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1510197346181_0014 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4 2017-11-09 14:50:59,642 Stage-1 map = 0%, reduce = 0% 2017-11-09 14:51:38,682 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 5.04 sec 2017-11-09 14:52:31,935 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 7.91 sec 2017-11-09 14:52:33,467 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 15.51 sec 2017-11-09 14:52:39,420 Stage-1 map = 100%, reduce = 83%, Cumulative CPU 22.5 sec 2017-11-09 14:52:40,953 Stage-1 map = 100%, reduce = 92%, Cumulative CPU 25.86 sec 2017-11-09 14:52:42,243 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 28.01 sec MapReduce Total cumulative CPU time: 28 seconds 10 msec Ended Job = job_1510197346181_0014 Loading data to table default.stu_buck Table default.stu_buck stats: [numFiles=4, numRows=22, totalSize=527, rawDataSize=505] MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 28.01 sec HDFS Read: 18642 HDFS Write: 819 SUCCESS Total MapReduce CPU Time Spent: 28 seconds 10 msec OK Time taken: 153.794 seconds
我们设置reduce的数量为4,学过mapreduce的人应该知道reduce数等于分区数,也等于处理的文件数量。