步骤一:配置环境,特别注意虚拟机的hosts文件和window中的hosts文件,以及关闭防火墙
linux: sudo vim /etc/hosts 添加ip到name的映射,从而让外网访问你时用name便可
window: C:\Windows\System32\drivers\etc 将linux /etc/hosts 中的映射写进去,这用于在window上可以访问linux
测试:ping inux虚拟机ip /name (测试网络互通性)
window cmd : telnet +linux虚拟机ip /name + 端口 (测试端口是否能被访问到)
步骤二:在idea上编写wordcount案例!
package com.imooc.spark
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Hello world!
*
*/
/**
* Spark Streaming处理Socket数据
*
* 测试: nc
*/
object NetworkWordCount {
/**官方介绍
* Counts words in UTF8 encoded, '\n' delimited text received from the network every second.
*
* Usage: NetworkWordCount <hostname> <port>
* <hostname> and <port> describe the TCP server that Spark Streaming would connect to receive data.
*
* To run this on your local machine, you need to first run a Netcat server
* `$ nc -lk 9999`
* and then run the example
* `$ bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999`
*/
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
/**
* 创建StreamingContext需要两个参数:SparkConf和batch interval
*/
val ssc = new StreamingContext(sparkConf, Seconds(5))
val lines = ssc.socketTextStream("hadoop002", 6789)
val result = lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
result.print()
ssc.start()
ssc.awaitTermination()
}
}
pom.xml
<properties>
<scala.version>2.11.8</scala.version>
<spark.version>2.2.0</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!-- Spark Streaming 依赖-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.5</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.module</groupId>
<artifactId>jackson-module-scala_2.11</artifactId>
<version>2.6.5</version>
</dependency>
<dependency>
<groupId>net.jpountz.lz4</groupId>
<artifactId>lz4</artifactId>
<version>1.3.0</version>
</dependency>
</dependencies>
步骤三:启动测试:
linux 输入nc -lk 6789
window上的idea启动项目
在linux nc -lk 6789 下输入 a b a b c d
在window的idea上看到统计的结果
可能异常:connnect time out!
原因:1.防火墙没有关闭,
2. linux 机器ip name映射没有添加到windows的hosts文件中,
3.linux 上机器name 没生成成功,ip 到name 映射没有生效
4.代码上ip/name 和端口写错了