MapReduce入门例子之WordCount单词计数-CFANZ编程社区

教程目录

0x00 教程内容
0x01 单词计数

1. 操作流程
2. 源码
3. 源码简单解释

0x02 Web UI界面查看

1. YARN

0xFF 总结

0x00 教程内容

单词计数操作流程
编写MapReduce单词计数代码及简单解释
YARN Web UI界面查看

0x01 单词计数

1. 操作流程

a. 建Maven项目

b. 导入依赖包

PS：a、b两步可参考此文章的0x01 新建maven工程：
Java API实现HDFS的相关操作

c. 写代码

d. 打包到服务器

e. 准备一份文件，以空格进行分割，放于HDFS上（可自行修改）：

/files/put.txt

我的数据：

shao nai yi
nai nai yi yi
shao nai nai

f. 启动服务器的HDFS、YARN

g. 执行作业（自行修改）：

hadoop jar hadoop-learning-1.0.jar com.shaonaiyi.hadoop.WordCount hdfs://master:9999/files/put.txt hdfs://master:9999/output/wc/

2. 源码

package com.shaonaiyi.hadoop;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @Auther: 邵奈一
 * @Date: 2019/03/21 下午 7:02
 * @Description: WordCount入门例子之单词计数（Java版）
 * 使用脚本：hadoop jar hadoop-learning-1.0.jar com.shaonaiyi.hadoop.WordCount hdfs://master:9999/files/put.txt hdfs://master:9999/output/wc/
 */
public class WordCount {

    public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> {

        LongWritable one = new LongWritable(1);

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            String lines = value.toString();
            String[] words = lines.split(" ");
            for (String word: words){
                context.write(new Text(word), one);
            }
        }

    }

    public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {

            int sum = 0;
            for (LongWritable value: values){
                sum += value.get();
            }
            context.write(key, new LongWritable(sum));

        }
    }

    public static void main(String[] args) throws Exception{

        Configuration configuration = new Configuration();

        // 若输出路径有内容，则先删除
        Path outputPath = new Path(args[1]);
        FileSystem fileSystem = FileSystem.get(configuration);
        if(fileSystem.exists(outputPath)){
            fileSystem.delete(outputPath, true);
            System.out.println("路径存在，但已被删除");
        }

        Job job = Job.getInstance(configuration, "WordCount");

        job.setJarByClass(WordCount.class);

        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

3. 源码简单解释

a. 可改切割符，目前为空格：

String[] words = lines.split(" ");

b. 分别为第一个参数输入路径args[0]与第二个参数传出路径args[1]：

FileInputFormat.setInputPaths(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

0x02 Web UI界面查看

1. YARN

a. 打开UI界面

b. 点击RUNNING、FINISHED可分别点击查看作业的进度

MapReduce入门例子之WordCount单词计数_apache

0xFF 总结

不能在本地直接用IDEA执行，要打包然后上传到服务器上
思考题：请为程序加个足够长的执行时间，然后查看执行作业时，三台服务器上的进程变化。
思路：在Reduce类添加3秒延迟，在主类设置成2个reduce结果，执行代码时统计多几个文件，用*号通配，然后一直观察三天服务器的进程，期间也可以查看YARN的Web UI界面上的Map和Reduce有几个。

作者简介：邵奈一

大学大数据讲师、大学市场洞察者、专栏编辑

公众号、微博：邵奈一

复制粘贴玩转大数据系列专栏已经更新完成，请跳转学习！