hive与hbase整合过程-CFANZ编程社区

hive与hbase整合过程---coco 



 # by coco 

 # 2014-07-25 



  本文主要实现一下目标： 

    1. 在hive中创建的表能直接创建保存到hbase中。 

    2. hive中的表插入数据，插入的数据会同步更新到hbase对应的表中。 

    3. hbase对应的列簇值变更，也会在Hive中对应的表中变更。 

    4. 实现了多列，多列簇的转化：（示例：hive中3列对应hbase中2列簇） 

     



  hive与hbase的整合 

  1. 创建hbase识别的表： 

 hive>  CREATE TABLE hbase_table_1(key int, value string)     

     > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'   

     > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")    

     > TBLPROPERTIES ("hbase.table.name" = "xyz"); 

 OK 

 Time taken: 1.833 seconds 

 hbase.table.name 定义在hbase的table名称  

 hbase.columns.mapping 定义在hbase的列族  

 hbase中看到的表： 

 hbase(main):007:0> list 

 TABLE                                                                                                                         

 hivetest                                                                                                                      

 student                                                                                                                       

 test                                                                                                                          

 xyz                                                                                                                           

 4 row(s) in 0.1050 seconds 



 => ["hivetest", "student", "test", "xyz"] 



 2.使用sql导入数据  

 i.预先准备数据  

 a)新建hive的数据表 

 hive> create table ccc(foo int,bar string) row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile; 

 OK 

 Time taken: 2.563 seconds 

 [root@db96 ~]# cat kv1.txt  

 1       val_1 

 2       val_2 

 这个文件位于root目录下，/root/kv1.txt 

    

 [root@db96 ~]#  

 hive> load data local inpath '/root/kv1.txt' overwrite into table ccc; 

 Copying data from file:/root/kv1.txt 

 Copying file: file:/root/kv1.txt 

 Loading data to table default.ccc 

 rmr: DEPRECATED: Please use 'rm -r' instead. 

 Deleted hdfs://db96:9000/hive/warehousedir/ccc 

 [Warning] could not update stats. 

 OK 

 Time taken: 2.796 seconds 

 hive> select * from ccc; 

 OK 

 1       val_1 

 2       val_2 

 NULL    NULL 

 Time taken: 0.348 seconds, Fetched: 3 row(s) 

 hive> 

 使用sql导入hbase_table_1 

 hive> insert overwrite table hbase_table_1 select * from ccc where foo=1; 

 Total jobs = 1 

 Launching Job 1 out of 1 

 Number of reduce tasks is set to 0 since there's no reduce operator 

 Starting Job = job_1406161997851_0002, Tracking URL = http://db96:8088/proxy/application_1406161997851_0002/ 

 Kill Command = /usr/local/hadoop//bin/hadoop job  -kill job_1406161997851_0002 

 Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0 

 2014-07-24 16:04:48,938 Stage-0 map = 0%,  reduce = 0% 

 2014-07-24 16:04:57,571 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 2.54 sec 

 MapReduce Total cumulative CPU time: 2 seconds 540 msec 

 Ended Job = job_1406161997851_0002 

 MapReduce Jobs Launched:  

 Job 0: Map: 1   Cumulative CPU: 2.54 sec   HDFS Read: 217 HDFS Write: 0 SUCCESS 

 Total MapReduce CPU Time Spent: 2 seconds 540 msec 

 OK 

 Time taken: 27.648 seconds 



 查看数据 

 会显示刚刚插入的数据  

 1       val_1 

 hive> select * from hbase_table_1; 

 OK 

 1       val_1 

 Time taken: 1.143 seconds, Fetched: 1 row(s) 



 hbase 登录hbase 

 查看加载的数据 

 hbase(main):008:0> scan "xyz" 

 ROW                           COLUMN+CELL                                                                           

  1                            column=cf1:val, timestamp=1406189096793, value=val_1                                  

 1 row(s) in 0.1090 seconds 



 hbase(main):009:0>  

 可以看到，在hive中添加的数据86，已经在hbase中了. 

 添加数据: 

 hbase(main):009:0> put 'xyz','100','cf1:val','www.gongchang.com' 

 hbase(main):011:0> put 'xyz','200','cf1:val','hello,word!' 

 hbase(main):012:0> scan "xyz" 

 ROW                           COLUMN+CELL                                                                           

  1                            column=cf1:val, timestamp=1406189096793, value=val_1                                  

  100                          column=cf1:val, timestamp=1406189669476, value=www.gongchang.com                      

  200                          column=cf1:val, timestamp=1406189704742, value=hello,word!                            

 3 row(s) in 0.0240 seconds 



 Hive  

 参看hive中的数据 

 hive> select * from hbase_table_1; 

 OK 

 1       val_1 

 100     www.gongchang.com 

 200     hello,word! 

 Time taken: 1.097 seconds, Fetched: 3 row(s) 

 hive>  

 刚刚在hbase中插入的数据，已经在hive里了. 



 hive访问已经存在的hbase 

 hbase中的元数据准备： 

 hbase(main):014:0> describe "student" 

 DESCRIPTION                                                                ENABLED                                  

  'student', {NAME => 'info', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => true                                     

   'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE',                                          

   MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false',                                          

   BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}                                                 

 1 row(s) in 0.1380 seconds 



 hbase(main):015:0> put "student",'1','info:name','tom' 

 hbase(main):017:0> put "student",'2','info:name','lily' 

 hbase(main):018:0> put "student",'3','info:name','wwn' 

 hbase(main):019:0> scan "student" 

 ROW                           COLUMN+CELL                                                                           

  1                            column=info:name, timestamp=1406189948888, value=tom                                  

  2                            column=info:name, timestamp=1406190005724, value=lily                                 

  3                            column=info:name, timestamp=1406190016967, value=wwn                                  

 3 row(s) in 0.0420 seconds 



 hive访问已经存在的hbase  

 使用CREATE EXTERNAL TABLE： 

 CREATE EXTERNAL TABLE hbase_table_3(key int, value string)     

 STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'   

 WITH SERDEPROPERTIES ("hbase.columns.mapping" = "info:name")    

 TBLPROPERTIES("hbase.table.name" = "student");  

 hive> CREATE EXTERNAL TABLE hbase_table_3(key int, value string)     

     > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'   

     > WITH SERDEPROPERTIES ("hbase.columns.mapping" = "info:name")    

     > TBLPROPERTIES("hbase.table.name" = "student");  

 OK 

 Time taken: 1.21 seconds 

 hive> select * from hbase_table_3; 

 OK 

 1       tom 

 2       lily 

 3       wwn 

 Time taken: 0.107 seconds, Fetched: 3 row(s) 

 由上可以看出，hive已经能访问查看hbase中原有的数据了。 

 注意：如果hbase中列簇名name数据变更，那么hive中查询结果也会相应的变更，如果hbase中不是其他列簇 

     内容更新则hive中查询结果不显示。 

      

 三、多列和多列族（Multiple Columns and Families）  

 1．创建数据库 



 CREATE TABLE hbase_table_add1(key int, value1 string, value2 int, value3 int)     

 STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'   

 WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:col1,info:col2,city:nu") 

 TBLPROPERTIES("hbase.table.name" = "student_info");    

 登陆hive操作： 

 hive> CREATE TABLE hbase_table_add1(key int, value1 string, value2 int, value3 int)     

     > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'   

     > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:col1,info:col2,city:nu") 

     > TBLPROPERTIES("hbase.table.name" = "student_info");  

 OK 

 Time taken: 2.957 seconds 

 hive> select * from hbase_table_2;                    

 OK 

 Time taken: 1.16 seconds 

 hive> select * from hbase_table_3; 

 OK 

 1       tom 

 2       lily 

 3       wwn 

 4       marry 

 Time taken: 0.117 seconds, Fetched: 4 row(s) 

 hive> set hive.cli.print.header=true;                 

 hive> select * from hbase_table_3;    

 OK 

 hbase_table_3.key       hbase_table_3.value 

 1       tom 

 2       lily 

 3       wwn 

 4       marry 

 Time taken: 1.132 seconds, Fetched: 4 row(s) 

 hive> desc hbase_table_3; 

 OK 

 col_name        data_type       comment 

 key                     int                     from deserializer    

 value                   string                  from deserializer    

 Time taken: 0.19 seconds, Fetched: 2 row(s) 

 hive> insert overwrite table hbase_table_add1 select key,value,key+1,value from hbase_table_3; 

 Total jobs = 1 

 Launching Job 1 out of 1 

 Number of reduce tasks is set to 0 since there's no reduce operator 

 Starting Job = job_1406161997851_0003, Tracking URL = http://db96:8088/proxy/application_1406161997851_0003/ 

 Kill Command = /usr/local/hadoop//bin/hadoop job  -kill job_1406161997851_0003 

 Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0 

 2014-07-25 08:42:46,068 Stage-0 map = 0%,  reduce = 0% 

 2014-07-25 08:42:56,218 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 2.77 sec 

 MapReduce Total cumulative CPU time: 2 seconds 770 msec 

 Ended Job = job_1406161997851_0003 

 MapReduce Jobs Launched:  

 Job 0: Map: 1   Cumulative CPU: 2.77 sec   HDFS Read: 239 HDFS Write: 0 SUCCESS 

 Total MapReduce CPU Time Spent: 2 seconds 770 msec 

 OK 

 _col0   _col1   _col2   _col3 

 Time taken: 28.01 seconds 

 hive> select * from  hbase_table_add1; 

 OK 

 hbase_table_add1.key    hbase_table_add1.value1 hbase_table_add1.value2 hbase_table_add1.value3 

 1       tom     2       NULL 

 2       lily    3       NULL 

 3       wwn     4       NULL 

 4       marry   5       NULL 

 Time taken: 1.105 seconds, Fetched: 4 row(s) 

 hive> insert overwrite table hbase_table_add1 select key,value,key+1,key+100 from hbase_table_3; 

 Total jobs = 1 

 Launching Job 1 out of 1 

 Number of reduce tasks is set to 0 since there's no reduce operator 

 Starting Job = job_1406161997851_0004, Tracking URL = http://db96:8088/proxy/application_1406161997851_0004/ 

 Kill Command = /usr/local/hadoop//bin/hadoop job  -kill job_1406161997851_0004 

 Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0 

 2014-07-25 08:45:15,164 Stage-0 map = 0%,  reduce = 0% 

 2014-07-25 08:45:25,609 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 2.69 sec 

 MapReduce Total cumulative CPU time: 2 seconds 690 msec 

 Ended Job = job_1406161997851_0004 

 MapReduce Jobs Launched:  

 Job 0: Map: 1   Cumulative CPU: 2.69 sec   HDFS Read: 239 HDFS Write: 0 SUCCESS 

 Total MapReduce CPU Time Spent: 2 seconds 690 msec 

 OK 

 key     value   _c2     _c3 

 Time taken: 25.587 seconds 

 hive> select * from hbase_table_add1; 

 OK 

 hbase_table_add1.key    hbase_table_add1.value1 hbase_table_add1.value2 hbase_table_add1.value3 

 1       tom     2       101 

 2       lily    3       102 

 3       wwn     4       103 

 4       marry   5       104 

 Time taken: 1.122 seconds, Fetched: 4 row(s) 



 登陆hbase中查看： 

 hbase(main):001:0> list 

 TABLE                                                                                                               

 SLF4J: Class path contains multiple SLF4J bindings. 

 SLF4J: Found binding in [jar:file:/usr/local/hbase-0.96.2-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] 

 SLF4J: Found binding in [jar:file:/usr/local/hadoop2/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] 

 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 

 shivetest                                                                                                            

 student                                                                                                             

 student_info                                                                                                        

 test                                                                                                                

 xyz                                                                                                                 

 5 row(s) in 2.4090 seconds 



 => ["hivetest", "student", "student_info", "test", "xyz"] 

 hbase(main):002:0> scan "student_info" 

 ROW                           COLUMN+CELL                                                                           

  1                            column=city:nu, timestamp=1406249125147, value=101                                    

  1                            column=info:col1, timestamp=1406249125147, value=tom                                  

  1                            column=info:col2, timestamp=1406249125147, value=2                                    

  2                            column=city:nu, timestamp=1406249125147, value=102                                    

  2                            column=info:col1, timestamp=1406249125147, value=lily                                 

  2                            column=info:col2, timestamp=1406249125147, value=3                                    

  3                            column=city:nu, timestamp=1406249125147, value=103                                    

  3                            column=info:col1, timestamp=1406249125147, value=wwn                                  

  3                            column=info:col2, timestamp=1406249125147, value=4                                    

  4                            column=city:nu, timestamp=1406249125147, value=104                                    

  4                            column=info:col1, timestamp=1406249125147, value=marry                                

  4                            column=info:col2, timestamp=1406249125147, value=5                                    

 4 row(s) in 0.1110 seconds 



 hbase(main):003:0>
这里有3个hive的列,(value1和value2，value3),2个hbase的列簇(info,city)
hive的2列(value,和value2)对应1个hbase的列簇（info，在hbase的列名称col1,col2）,
hive的另外1列(value3)对应列nu位于city列簇。
这里实现了hive中表，多列存放到hbase少量固定的列簇中。