更多代码请见:https://github.com/xubo245/SparkLearning
1.理解:HdfsWordCount 是从hdfs的文件读入流文件,即制定文件目录,每个一段时间扫描该路径下的文件,不扫描子目录下的文件。
如果有新增加的文件,则进行流计算
val ssc = new StreamingContext(sparkConf, Seconds(2))
处理跟前面差不多
2.运行:
输入:
hadoop@Master:~/cloud/testByXubo/spark/Streaming/data$ hadoop fs -put 2.txt /xubo/spark/data/Streaming/hdfsWordCount/
hadoop@Master:~/cloud/testByXubo/spark/Streaming/data$ hadoop fs -put 3.txt /xubo/spark/data/Streaming/hdfsWordCount/
hadoop@Master:~/cloud/testByXubo/spark/Streaming/data$ cat 3.txt
hello world
hello world
hello world
hello world
hello world
hello world
hello world
a
a
a
a
a
a
a b b b
输出:
16/04/26 21:26:06 INFO scheduler.DAGScheduler: Job 19 finished: print at HdfsWordCount.scala:52, took 0.023056 s
-------------------------------------------
Time: 1461677166000 ms
-------------------------------------------
(hello,1)
(world,1)
加入文件后:
-------------------------------------------
Time: 1461677550000 ms
-------------------------------------------
(b,3)
(hello,7)
(world,7)
(a,7)
3.源码:
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
// scalastyle:off println
package org.apache.spark.Streaming.learning
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.DStream.toPairDStreamFunctions
/**
* Counts words in new text files created in the given directory
* Usage: HdfsWordCount <directory>
* <directory> is the directory that Spark Streaming will use to find and read new text files.
*
* To run this on your local machine on directory `localdir`, run this example
* $ bin/run-example \
* org.apache.spark.examples.streaming.HdfsWordCount localdir
*
* Then create a text file in `localdir` and the words in the file will get counted.
*/
object HdfsWordCount {
def main(args: Array[String]) {
if (args.length < 1) {
System.err.println("Usage: HdfsWordCount <directory>")
System.exit(1)
}
StreamingExamples.setStreamingLogLevels()
val sparkConf = new SparkConf().setAppName("HdfsWordCount")
// Create the context
val ssc = new StreamingContext(sparkConf, Seconds(2))
// Create the FileInputDStream on the directory and use the
// stream to count words in new files created
val lines = ssc.textFileStream(args(0))
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
// scalastyle:on println