0
点赞
收藏
分享

微信扫一扫

Spark webui 界面操作HDFS系统中文件

1.操作HDFS中文件

  1. SparkUI 界面介绍
[root@node1 bin]# ./spark-shell --master spark://node1:7077 --name yqq
2021-12-12 18:53:10 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://node1:4040
Spark context available as 'sc' (master = spark://node1:7077, app id = app-20211212185338-0004).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.1
/_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_221)
Type in expressions to have them evaluated.
Type :help for more information.

scala>
  1. 准备数据源
[root@node4 ~]# vim mydata
hello java
hello spark
hello php
hello ptyhon
hello scala
hello scala
hello java
hello hive
hello hive
hello hbase
hello spark
hello spark
  1. 上传数据源到HDFS系统
[root@node4 ~]# hdfs dfs -mkdir -p /spark/data
[root@node4 ~]# hdfs dfs -put mydata /spark/data

Spark webui 界面操作HDFS系统中文件_big data
4. 从HDFS读取文件计数

scala> sc.textFile("hdfs://mycluster/spark/data/mydata").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect()
res0: Array[(String, Int)] = Array((scala,2), (hive,2), (php,1), (hello,12), (java,2), (ptyhon,1), (spark,3), (hbase,1))
  1. 不序列化持久
scala> var rdd=sc.textFile("hdfs://mycluster/spark/data/mydata")
rdd: org.apache.spark.rdd.RDD[String] = hdfs://mycluster/spark/data/mydata MapPartitionsRDD[1] at textFile at <console>:24

scala> rdd.cache()
res1: org.apache.spark.rdd.RDD[String] = hdfs://mycluster/spark/data/mydata MapPartitionsRDD[1] at textFile at <console>:24

scala> rdd.count
res2: Long = 12

Spark webui 界面操作HDFS系统中文件_hdfs_02
7. 清除序列化持久

scala> rdd.unpersist()
  1. 序列化持久
scala> import org.apache.spark.storage.StorageLevel
import org.apache.spark.storage.StorageLevel
scala> rdd.persist(StorageLevel.MEMORY_ONLY_SER)
res5: org.apache.spark.rdd.RDD[String] = hdfs://mycluster/spark/data/mydata MapPartitionsRDD[1] at textFile at <console>:24
scala> rdd.count()
res6: Long = 12

Spark webui 界面操作HDFS系统中文件_hdfs_03

2. 配置historyServer

  1. 配置 spark-defaults.conf
[root@node1 conf]# vim spark-defaults.conf
# spark.master spark://master:7077
//开启记录事件日志的功能
spark.eventLog.enabled true
//设置事件日志存储的目录
spark.eventLog.dir hdfs://mycluster/spark/log
//设置 HistoryServer 加载事件日志的位置
spark.history.fs.logDirectory hdfs://mycluster/spark/log
//日志优化选项,压缩日志
spark.eventLog.compress true

#注意要先在hdfs上创建log目录,不然会报错
[root@node4 ~]# hdfs dfs -mkdir -p /spark/log
  1. 启动 HistoryServer:
[root@node1 sbin]# ./start-history-server.sh

访问 HistoryServer:node4:18080,之后所有提交的应用程序运 行状况都会被记录。
Spark webui 界面操作HDFS系统中文件_scala_04

Spark webui 界面操作HDFS系统中文件_scala_05


举报

相关推荐

0 条评论