配置Hive使用Spark执行引擎
Hive引擎
概述
MapReduce引擎:
Tez引擎:
Spark引擎:
使用Spark作为Hive的执行引擎可以带来以下好处:
更快的执行速度:Spark具有内存计算的能力,可以在执行过程中缓存数据,加快查询速度
更高的交互性:Spark支持迭代式查询和实时数据处理,适用于需要更快响应时间的应用场景
更好的资源管理:Spark可以与其他Spark应用程序共享资源,实现更好的资源管理和利用
兼容问题
在Hive解压目录,查看Hive支持的Spark版本
[root@node01 hive]# ls lib/spark-*
lib/spark-core_2.11-2.3.0.jar lib/spark-launcher_2.11-2.3.0.jar lib/spark-network-shuffle_2.11-2.3.0.jar lib/spark-unsafe_2.11-2.3.0.jar
lib/spark-kvstore_2.11-2.3.0.jar lib/spark-network-common_2.11-2.3.0.jar lib/spark-tags_2.11-2.3.0.jar
解决方案:
1.下载与当前Hive版本使用的Spark版本
2.重新编译Hive,使其支持更高的Spark版本
安装Spark
下载Spark
https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-3.4.0/spark-3.4.0-bin-without-hadoop.tgz
解压及重命名
tar -zxvf spark-3.4.0-bin-without-hadoop.tgz
mv spark-3.4.0-bin-without-hadoop spark
Spark配置
修改文件名
mv conf/spark-env.sh.template conf/spark-env.sh
vim conf/spark-env.sh
,添加配置
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
配置SPARK_HOME环境变量
# Spark
export SPARK_HOME=/usr/local/program/spark
export PATH=$PATH:$SPARK_HOME/bin
使配置生效
source /etc/profile
Hive配置
在hive中创建spark配置文件
vim conf/spark-defaults.conf
参数代表:在执行任务时,会根据如下参数执行
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs://node01:9000/spark/history
spark.executor.memory 1g
spark.driver.memory 1g
在HDFS创建目录,用于存储历史日志
hadoop fs -mkdir -p /spark/history
HDFS上传Spark的jar包
为什么要HDFS上传Spark的jar包?
-
使用的是
spark-3.4.0-bin-without-hadoop.tgz
版本,不带hadoop和hive相关依赖 -
Hive任务由Spark执行,Spark任务资源分配由Yarn来调度,该任务有可能被分配到集群的任何一个节点
-
因此需要将Spark的依赖上传到HDFS集群路径,让集群中任何一个节点都能获取到
hadoop fs -mkdir -p /spark/jars
hadoop fs -put spark/jars/* /spark/jars
修改hive-site.xml文件
<!--Spark依赖位置 注意:端口号9000必须和namenode的端口号一致 -->
<property>
<name>spark.yarn.jars</name>
<value>hdfs://node01:9000//spark/jars/*</value>
</property>
<!--Hive执行引擎-->
<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<!--Hive和Spark连接超时时间-->
<property>
<name>hive.spark.client.connect.timeout</name>
<value>10000ms</value>
</property>
执行测试
hive (default)> create table tb_user(id int,name string,age int);
hive (default)> insert into tb_user values(2,'hive',20);
查看Yarn控制台:
注意:
配置示例:
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>0.5</value>
</property>
速度对比
MapReduce引擎:
2023-08-07 20:11:22,834 INFO [2704e498-c1b3-4dd5-8658-1f0a1393a3bb main] ql.Driver (SessionState.java:printInfo(1227)) - MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.51 sec HDFS Read: 16233 HDFS Write: 276 SUCCESS
2023-08-07 20:11:22,834 INFO [2704e498-c1b3-4dd5-8658-1f0a1393a3bb main] ql.Driver (SessionState.java:printInfo(1227)) - Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.51 sec HDFS Read: 16233 HDFS Write: 276 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 510 msec
2023-08-07 20:11:22,834 INFO [2704e498-c1b3-4dd5-8658-1f0a1393a3bb main] ql.Driver (SessionState.java:printInfo(1227)) - Total MapReduce CPU Time Spent: 3 seconds 510 msec
2023-08-07 20:11:22,834 INFO [2704e498-c1b3-4dd5-8658-1f0a1393a3bb main] ql.Driver (Driver.java:execute(2531)) - Completed executing command(queryId=root_20230807200946_06634674-a1f5-4cfa-ae34-166bfda3d90e); Time taken: 92.685 seconds
OK
2023-08-07 20:11:22,834 INFO [2704e498-c1b3-4dd5-8658-1f0a1393a3bb main] ql.Driver (SessionState.java:printInfo(1227)) - OK
2023-08-07 20:11:22,834 INFO [2704e498-c1b3-4dd5-8658-1f0a1393a3bb main] ql.Driver (Driver.java:checkConcurrency(285)) - Concurrency mode is disabled, not creating a lock manager
col1 col2 col3
Time taken: 96.059 seconds
Yarn引擎:
--------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------
STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED
--------------------------------------------------------------------------------------
STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED
--------------------------------------------------------------------------------------
Stage-0 ........ 0 FINISHED 1 1 0 0 0
Stage-1 ........ 0 FINISHED 1 1 0 0 0
--------------------------------------------------------------------------------------
STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 10.24 s
--------------------------------------------------------------------------------------
由此可大概粗略得知:
注意: