Spark组件之Spark Streaming学习4--HdfsWordCount 学习-CFANZ编程社区

更多代码请见：https://github.com/xubo245/SparkLearning

1.理解：HdfsWordCount 是从hdfs的文件读入流文件，即制定文件目录，每个一段时间扫描该路径下的文件，不扫描子目录下的文件。

如果有新增加的文件，则进行流计算

val ssc = new StreamingContext(sparkConf, Seconds(2))

处理跟前面差不多

2.运行：

输入：

hadoop@Master:~/cloud/testByXubo/spark/Streaming/data$ hadoop fs -put 2.txt /xubo/spark/data/Streaming/hdfsWordCount/
hadoop@Master:~/cloud/testByXubo/spark/Streaming/data$ hadoop fs -put 3.txt /xubo/spark/data/Streaming/hdfsWordCount/

hadoop@Master:~/cloud/testByXubo/spark/Streaming/data$ cat 3.txt 
hello world
hello world
hello world
hello world
hello world
hello world
hello world
a
a
a
a
a
a
a b b b

输出：

16/04/26 21:26:06 INFO scheduler.DAGScheduler: Job 19 finished: print at HdfsWordCount.scala:52, took 0.023056 s
-------------------------------------------
Time: 1461677166000 ms
-------------------------------------------
(hello,1)
(world,1)

加入文件后：

-------------------------------------------
Time: 1461677550000 ms
-------------------------------------------
(b,3)
(hello,7)
(world,7)
(a,7)

3.源码：

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

// scalastyle:off println
package org.apache.spark.Streaming.learning

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.DStream.toPairDStreamFunctions

/**
 * Counts words in new text files created in the given directory
 * Usage: HdfsWordCount <directory>
 *   <directory> is the directory that Spark Streaming will use to find and read new text files.
 *
 * To run this on your local machine on directory `localdir`, run this example
 *    $ bin/run-example \
 *       org.apache.spark.examples.streaming.HdfsWordCount localdir
 *
 * Then create a text file in `localdir` and the words in the file will get counted.
 */
object HdfsWordCount {
  def main(args: Array[String]) {
    if (args.length < 1) {
      System.err.println("Usage: HdfsWordCount <directory>")
      System.exit(1)
    }

    StreamingExamples.setStreamingLogLevels()
    val sparkConf = new SparkConf().setAppName("HdfsWordCount")
    // Create the context
    val ssc = new StreamingContext(sparkConf, Seconds(2))

    // Create the FileInputDStream on the directory and use the
    // stream to count words in new files created
    val lines = ssc.textFileStream(args(0))
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}
// scalastyle:on println