参考自书籍《Hadoop+Spark 大数据巨量分析与机器学习》
环境依赖:
jdk 1.7
hadoop 2.8.4
scala 2.11.6
spark 2.1.2
1 安装scala
$ wget https://www.scala-lang.org/files/archive/scala-2.11.6.tgz
$ tar xvf scala-2.11.6.tgz
$ sudo mv scala-2.11.6 /usr/local/scala
//添加scala环境变量
$ vim ~/.bashrc //添加如下内容
#SCALA
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin
$ source ~/.bashrc //使其生效
2 测试scala
$ scala
3 安装Spark
查看hadoop版本
$ hadoop version //我这里是2.8.4
选择对应版本安装spark(下载官网)
$ wget https://archive.apache.org/dist/spark/spark-2.1.2/spark-2.1.2-bin-hadoop2.7.tgz
$ tar zxf spark-2.1.2-bin-hadoop2.7.tgz
$ sudo mv spark-2.1.2-bin-hadoop2.7 /usr/local/spark
//添加spark环境变量
$ vim ~/.bashrc //添加如下内容
#SPARK
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
$ source ~/.bashrc //使其生效
4 启动spark-shell交互界面
$ spark-shell
5 设置spark-shell显示信息
$ cd /usr/local/spark/conf
$ cp log4j.properties.template log4j.properties
$ sudo vim log4j.properties
将log4j.rootCategory=INFO 改为 WARN
$ spark-shell //然后就发现现实的东西少了很多,没这么碍眼
6 启动hadoop然后本地运行spark-shell
读取HDFS文件
> val textFile=sc.textFile("hdfs://192.168.80.100:9000/user/hduser/wordcount/input/LICENSE.txt") //这里ip是对应master,可见hadoop core-site.xml配置
> textFile.count
7 Hadoop YARN运行spark-shell
$ SPARK_JAR=/usr/local/spark/yarn/spark-2.1.2-yarn-shuffle.jar HADOOP_CONF_DIR=/home/hadoop/hadoop/etc/hadoop MASTER=yarn-client /usr/local/spark/bin/spark-shell
然后可见下面scala>提示符
读取本地文件
scala>
> val textFile=sc.textFile("file:/home/hadoop/hadoop/LICENSE.txt");
> textFile.count
读取hdfs文件
> val textFile=sc.textFile("hdfs://192.168.80.100:9000/user/hduser/wordcount/input/LICENSE.txt") //这里ip是对应master,可见hadoop core-site.xml配置
> textFile.count
然后访问:http://192.168.80.100:8088/cluster/apps,显示如下(192.168.80.100是hadoop的master的服务器ip)
8 构建Spark Standalone Cluster执行环境
//在master虚拟机中设置spark-env.sh
$ cp /usr/local/spark/conf/spark-env.sh.template /usr/local/spark/conf/spark-env.sh
$ sudo vim /usr/local/spark/conf/spark-env.sh //添加如下内容
export SPARK_MASTER_IP=192.168.80.100 #设置master的ip或服务器名称 这里对应hosts中的master
export SPARK_WORKER_CORES=1 #设置每个worker使用的cpu核心
export SPARK_WORKER_MEMORY=500m #设置每个worker使用内存 --800m推荐
export SPARK_WORKER_INSTANCES=2 #设置多个worker实例
然后拷贝到对应的hadoop slave服务器
$ ssh 192.168.80.101 //slave1服务器
$ sudo mkdir /usr/local/spark
$ sudo chown hadoop:hadoop /usr/local/spark
$ exit;
//退出后回到master服务器
$ sudo scp -r /usr/local/spark hadoop@192.168.80.101:/usr/local //远程拷贝spark到slave服务器
$ cp /usr/local/spark/conf/slaves.template /usr/local/spark/conf/slaves
$ vim /usr/local/spark/conf/slaves //编辑并添加你的slave服务器ip
192.168.80.101
在spark standalone运行spark-shell
$ /usr/local/spark/sbin/start-all.sh //然后就可以看到启动了1个master、2个worker
$ /usr/local/spark/sbin/start-master.sh -h 192.168.80.100
$ /usr/local/spark/sbin/start-slave.sh spark://192.168.80.100:7077
-------------以下是可不操作命令------------------
$ /usr/local/spark/sbin/start-master.sh //启动master服务
$ /usr/local/spark/sbin/start-slaves.sh //启动slaves服务
$ /usr/local/spark/sbin/stop-all.sh //停止所有服务
注
:为什么是两个ip192.168.80.101的worker,卫视在spark-env.sh设置了
SPARK_WORKER4_INSTANCES=2,所以一个从服务器会产生两个实例。
在spark standalone中运行spark-shell程序
$ spark-shell --master spark://192.168.80.100:7077
然后打开web输入 http://192.168.80.100:8080可见如下: