0
点赞
收藏
分享

微信扫一扫

007-spark的wordCount


测试文本内容


[hadoop@mycluster ~]$ cat /home/hadoop/wc.txt
 
 hello   me
 
 hello   you
 
 hello   china
 
 hello   you


1、读取本地或者HDFS文件



spark启动时候生成SparkContext 对象sc,通过spark的上下文对象sc读取文件



scala>
  var textFile = sc.textFile("/home/hadoop/wc.txt").collect



执行结果:textFile: Array[String] = Array(hello   me, hello       you, hello      china, hello    you)



2、执行文件



2.1 flatMap 对读取的结果通过制表符方式平摊



scala> var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(line => line.split("\t")).collect

或者


scala> var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).collect

结果:

textFile: Array[String] = Array(hello, me, hello, you, hello, china, hello, you)



2.2  map(word=>(word,1))   word表示每个单词,每个单词为1


scala> var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(line => line.split("\t")).map(word => (word,1)).collect

或者


scala> var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).map((_,1)).collect


结果:


textFile: Array[(String, Int)] = Array((hello,1), (me,1), (hello,1), (you,1), (hello,1), (china,1), (hello,1), (you,1))



2.3 执行reduceByKey函数


scala> var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey( (a,b) => a + b ).collect

或者


var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey( _ + _ ).collect

结果:

textFile: Array[(String, Int)] = Array((hello,4), (me,1), (you,2), (china,1))



2.4 key 字段进行排序


scala> var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey( _ + _ ).sortByKey(true).collect   

 

 结果:

 

 textFile: Array[(String, Int)] = Array((china,4), ("hello ",1), (me,1), (you,2))


2.5 输出结果保存本地或者HDFS上


var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey( _ + _ ).sortByKey(true).saveAsTextFile("/home/hadoop/output")

执行结果: 


[hadoop@mycluster output]$ more part-00000
(hello,4)
(me,1)
[hadoop@mycluster output]$ more part-00001
(you,2)
(china,1)




2.6 让输出结果仅生成一个文件


var textFile = sc.textFile("/home/hadoop/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_).repartition(1).saveAsTextFile("/home/hadoop/output")


执行结果:


[hadoop@mycluster output]$ more part-00000
(hello,4)
(me,1)
(you,2)
(china,1)




 出现次数最多的单词排在前面


var textFile = sc.textFile("hdfs://mycluster:9000/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map( x => (x._2,x._1) ).collect


007-spark的wordCount_hdfs


结果: textFile: Array[(String, Int)] = Array((hello,4), (you,2), (me,1), (china,1))








备注: 


以上就是wordcount的例子。下面给出读取hdfs上的数据的案例


var textFile = sc.textFile("hdfs://mycluster:9000/wc.txt").flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_).repartition(1).saveAsTextFile("hdfs://mycluster:9000/output")


执行结果:


[hadoop@mycluster output]$ hdfs dfs -cat hdfs://mycluster:9000/output/part-00000
(hello,4)
(me,1)
(you,2)
(china,1)

举报

相关推荐

0 条评论