大数据技术-hive3-CFANZ编程社区

以下hive版本3+，对应的hadoop也是3+

安装

大数据技术-hive3_hive

下载

➜  ~ wget https://mirrors.tuna.tsinghua.edu.cn/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz

解压

➜  ~ tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /opt/Apache/

配置环境变量

vim /etc/profile
...
export HIVE_HOME=/opt/Apache/apache-hive-3.1.2-bin
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HIVE_HOME/bin:$PATH
...

启动

初始化数据库

hive默认使用derby数据库管理元数据，有很大缺陷，同一时间只能允许有一个hive客户端。后面改用mysql管理元数据。

初始化数据库元数据

➜  apache-hive-3.1.2-bin bin/schematool -dbType derby -initSchema

注意：这里大概率会报错，解决办法请参考：初始化derby数据库报错

启动客户端

➜  apache-hive-3.1.2-bin bin/hive
which: no hbase in (/opt/Java/jdk1.8.0_261/bin:/opt/Apache/apache-maven-3.6.3/bin:/opt/node-v12.18.4-linux-x64/bin:/opt/Apache/apache-ant-1.9.15/bin:/opt/Apache/hadoop-3.2.1/bin:/opt/Apache/apache-hive-3.1.2-bin/bin:/usr/local/bin:/usr/bin:/home/sairo/bin:/usr/local/sbin:/usr/sbin)
Hive Session ID = 9ea641f6-4c3b-49db-877e-93cf945cea77

Logging initialized using configuration in jar:file:/opt/Apache/apache-hive-3.1.2-bin/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Hive Session ID = 5dcd28fe-648a-4e5c-99cd-853719674c78
hive>

简单SQL操作

hive> show databases;
OK
default
Time taken: 0.864 seconds, Fetched: 1 row(s)
hive> use default;
OK
Time taken: 0.056 seconds
hive> show tables;
OK
Time taken: 0.054 seconds
hive> create table test (id string, name string);
OK
Time taken: 0.837 seconds
hive> insert into test values('aaa', 'Tom');
Query ID = sairo_20201126194635_ab618e32-953d-4fd2-983b-dfc2d8abd4d2
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1606377510588_0001, Tracking URL = http://dev-jsj.com:8088/proxy/application_1606377510588_0001/
Kill Command = /opt/Apache/hadoop-3.2.1/bin/mapred job  -kill job_1606377510588_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-11-26 19:46:51,787 Stage-1 map = 0%,  reduce = 0%
2020-11-26 19:46:58,055 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.96 sec
2020-11-26 19:47:05,333 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 9.81 sec
MapReduce Total cumulative CPU time: 9 seconds 810 msec
Ended Job = job_1606377510588_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://dev-jsj.com:9000/user/hive/warehouse/test/.hive-staging_hive_2020-11-26_19-46-35_648_9001647310701991693-1/-ext-10000
Loading data to table default.test
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 9.81 sec   HDFS Read: 14565 HDFS Write: 243 SUCCESS
Total MapReduce CPU Time Spent: 9 seconds 810 msec
OK
Time taken: 32.519 seconds

使用MYSQL管理元数据

安装MYSQL

常规操作，直接跳过

提前创建好hive元数据库

mysql> create database hive_metastore default charset utf8;

向hive中添加mysql驱动包

➜  apache-hive-3.1.2-bin cp /opt/Apache/repository/mysql/mysql-connector-java/8.0.13/mysql-connector-java-8.0.13.jar ./lib

配置JDBC连接参数

conf目录下有很多配置文件模板，这里编辑hive-default.xml.template并重命名为hive-site.xml

大数据技术-hive3_hive_02

hive-site.xml:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <!-- jdbc 连接的 URL -->
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://dev-jsj.com:3306/hive_metastore?useSSL=false&characterEncoding=utf8</value>
  </property>
  <!-- jdbc 连接的 Driver-->  
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.cj.jdbc.Driver</value>
  </property>
  <!-- jdbc 连接的 username-->
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>  
  </property>
  <!-- jdbc 连接的 password -->
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>123456</value>
  </property>
  <!-- Hive 元数据存储版本的验证 -->
  <property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
  </property>
  <!--元数据存储授权-->
  <property>
    <name>hive.metastore.event.db.notification.api.auth</name>
    <value>false</value>
  </property>
</configuration>

注意：

配置jdbc url时，请将&改写为&（&符号在xml中有特殊语义，必须进行转义）
如果使用官方自带的配置文件模板，请将所有<property>的子标签<description>删除，官方用来解释这个配置属性的作用，但是很多描述带有特殊符号，不删除启动会报错。
有些教程可能会要求设置hive在hdfs中的存储路径，如:

<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
</property>

如非特殊需求，不需要配置，默认值就是：/user/hive/warehouse

初始化数据库

➜  apache-hive-3.1.2-bin  bin/schematool -dbType mysql -initSchema -verbose

-verbose: 显示执行过程，不像之前那样一大段空白

到此为止，非常基础的配置已经OK，用户可以在命令行直接操作hive，接下来配置如何远程连接hive，类似启动hadoop后台服务。

使用元数据服务的方式访问 Hive

大数据技术-hive3_hadoop_03

第一步：修改配置文件 hive-site.xml

➜  apache-hive-3.1.2-bin vim conf/hive-site.xml 
# 添加如下配置
<!-- 指定存储元数据要连接的地址 -->
<property>
    <name>hive.metastore.uris</name>
    <value>thrift://dev-jsj.com:9083</value>
</property>

第二步：启动元数据服务

➜  apache-hive-3.1.2-bin bin/hive --service metastore
# 或者
➜  apache-hive-3.1.2-bin nohup bin/hive --service metastore &

提示：元数据服务是一个前台进程，默认会占用当前会话窗口，可以使用nohup命令使其在后台运行

使用 JDBC 方式访问 Hive

所谓JDBC的方式就是在服务端开启hive服务，即暴露端口，客户端可远程连接，类似于mysql和hadoop

大数据技术-hive3_hive_04

第一步：修改配置文件 hive-site.xml

➜  apache-hive-3.1.2-bin vim conf/hive-site.xml 
# 添加如下配置
<!-- 指定 hiveserver2 连接的 host -->
<property>
    <name>hive.server2.thrift.bind.host</name>
    <value>dev-jsj.com</value>
</property>
<!-- 指定 hiveserver2 连接的端口号 -->
<property>
    <name>hive.server2.thrift.port</name>
    <value>10000</value>
</property>

第二步：启动 hiveserver2服务

➜  apache-hive-3.1.2-bin bin/hive --service hiveserver2
# 或者
➜  apache-hive-3.1.2-bin nohup bin/hive --service hiveserver2 &

提示：

启动hiveserver2前要先启动元数据服务
hiveserver2服务是一个前台进程，默认会占用当前会话窗口，可以使用nohup命令使其在后台运行

第三步：命令行模拟远程连接

➜  apache-hive-3.1.2-bin bin/beeline -u jdbc:hive2://dev-jsj.com:10000 -n sairo

大数据技术-hive3_hadoop_05

其他配置

日志配置

在$HIVE_HOME/conf/目录下有很多配置文件模板，找到hive-log4j2.properties.template ，配置相关属性就好。修改完成后记得重命名为hive-log4j2.properties

配置日志存放位置，默认在/tmp/{user}目录

大数据技术-hive3_hive_06

命令行格式配置

默认情况下在命令行中操作hive命令提示符非常简单，就像下面这样：

大数据技术-hive3_apache_07

可以添加配置使得在命令提示符中带有数据库标识符。配置如下：

➜  apache-hive-3.1.2-bin vim conf/hive-site.xml 
...
<property>
    <name>hive.cli.print.header</name>
    <value>true</value>
</property>
<property>
    <name>hive.cli.print.current.db</name>
    <value>true</value>
</property>
...

大数据技术-hive3_apache_08

题外话

有兴趣的读者在完成以上操作后可以试试使用 HUE 连接hive，可能会遇到很多坑，各种报错，笔者花了一下午才弄好。建议有时间爱折腾的读者试试。后面有时间笔者也会写篇文章介绍如何搭建HUE环境。

大数据技术-hive3_hive_09

问题解决

初始化derby数据库报错

➜  apache-hive-3.1.2-bin bin/schematool -dbType derby -initSchema
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/Apache/apache-hive-3.1.2-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/Apache/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V
        at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
        at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
        at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:536)
        at org.apache.hadoop.mapred.JobConf.setJarByClass(JobConf.java:554)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:448)
        at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:5141)
        at org.apache.hadoop.hive.conf.HiveConf.<init>(HiveConf.java:5104)
        at org.apache.hive.beeline.HiveSchemaTool.<init>(HiveSchemaTool.java:96)
        at org.apache.hive.beeline.HiveSchemaTool.main(HiveSchemaTool.java:1473)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:236)

原因：hive自带的guava*.jar与hadoop的guava.jar版本冲突，删除掉低版本替换为高版本

解决办法：

查找hive中guava*.jar的位置

➜  apache-hive-3.1.2-bin find ./ -name  "*guava*"                           
./lib/guava-19.0.jar
./lib/jersey-guava-2.25.1.jar

查看hadoop中guava*.jar的位置，一般在/opt/Apache/hadoop-3.2.1/share/hadoop/hdfs/lib/目录

➜  apache-hive-3.1.2-bin find /opt/Apache/hadoop-3.2.1 -name "*guava*"
/opt/Apache/hadoop-3.2.1/share/hadoop/common/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar
/opt/Apache/hadoop-3.2.1/share/hadoop/hdfs/lib/guava-27.0-jre.jar
/opt/Apache/hadoop-3.2.1/share/hadoop/hdfs/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar

删除hive中低版本guava并替换为hadoop的高版本guava

➜  apache-hive-3.1.2-bin mv ./lib/guava-19.0.jar ./lib/guava-19.0.jar.bak
➜  apache-hive-3.1.2-bin cp /opt/Apache/hadoop-3.2.1/share/hadoop/hdfs/lib/guava-27.0-jre.jar ./lib