0
点赞
收藏
分享

微信扫一扫

Hadoop源码详解之Mapper类

河南妞 2022-01-26 阅读 54


Hadoop源码详解之​​Mapper​​类

1. 类释义


Maps input key/value pairs to a set of intermediate key/value pairs.
将输入的键值对应成一系列的中间键值对



Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.
Maps 将输入记录转换成一个中间的记录 是单个任务。转换后的中间记录不需要与输入记录相同。一个给出的对可能映射0或多个输出对。



The Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. Mapper implementations can access the Configuration for the job via the JobContext.getConfiguration().
​Hadoop​​ 的 ​​Map-Reduce​​框架 对每一个job的​​InputFormat​​ ​​InputSplit​​产出一个 map任务。​​Mapper​​实现能够访问job的配置,通过 ​​JobContext.getConfiguration()​​ 。



The framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context), followed by map(Object, Object, org.apache.hadoop.mapreduce.Mapper.Context) for each key/value pair in the InputSplit. Finally cleanup(org.apache.hadoop.mapreduce.Mapper.Context) is called.
框架首先调用setup(…),紧接着是map(…)针对每一个key/value对在InputSplit中。最后是调用 ​​cleanup(...)​



All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to a Reducer to determine the final output. Users can control the sorting and grouping by specifying two key RawComparator classes.
与给出的输出键关联的所有的中间值随后由框架分组,并传递给 一个 ​​Reducer​​ 以确定最终输出。 用户可以通过指定两个关键的​​RawComparator​​ 类来控制排序和分组。 【什么是两个关键的​​RawComparator​​?】



The Mapper outputs are partitioned per Reducer. Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.
​Mapper​​ 输出被每一个​​Reducer​​ 分区。用户能够控制哪一个键(因此控制的是记录)去到哪一个​​Reducer​​通过实现自定义的​​Partitioner​



Users can optionally specify a combiner, via Job.setCombinerClass(Class), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.
用户能够选择一个指定的​​combiner​​,通过 ​​Job.setCombinerClass(Class)​​,去执行一个本地的聚合,针对中间的输出,这能够帮助减少从​​Mapper​​到​​Reducer​​时传输数据的数量。



Applications can specify if and how the intermediate outputs are to be compressed and which CompressionCodecs are to be used via the Configuration.
应用能够使用​​Configuration​​ 去指定是否以及怎样比较中间输出结果以及哪个​​CompressionCodecs​​会被使用。


2. 类源码

package org.apache.hadoop.mapreduce;

import java.io.IOException;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.mapreduce.task.MapContextImpl;

/*
* @see InputFormat
* @see JobContext
* @see Partitioner
* @see Reducer
*/
@InterfaceAudience.Public
@InterfaceStability.Stable
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

/**
* The <code>Context</code> passed on to the {@link Mapper} implementations.
*/
public abstract class Context
implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
}

/**
* Called once at the beginning of the task.
*/
protected void setup(Context context
) throws IOException, InterruptedException {
// NOTHING
}

/**
* Called once for each key/value pair in the input split. Most applications
* should override this, but the default is the identity function.
*/
@SuppressWarnings("unchecked")
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException, InterruptedException {
context.write((KEYOUT) key, (VALUEOUT) value);
}

/**
* Called once at the end of the task.
*/
protected void cleanup(Context context
) throws IOException, InterruptedException {
// NOTHING
}

/**
* Expert users can override this method for more complete control over the
* execution of the Mapper.
* @param context
* @throws IOException
*/
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
} finally {
cleanup(context);
}
}
}

3. 类方法

3.1 内部类 ​​Context​
  • 类释义


The Context passed on to the Mapper implementations.
传递给Mapper实现的Context


  • 类代码
public abstract class Context
implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
}

可以看到这个抽象类​​Context​​ 继承了​​MapContext​​ 这个接口,其泛型是​​<KEYIN,VALUEIN,KEYOUT,VALUEOUT>​​。

而​​MapContext​​这个类又继承自​​TaskInputOutputContext​​,这个​​TaskInputOutputContext​​ 类中有一个方法​​write()​​。在​​WordCountMapper​​类中,就是使用这个​​write()​​方法去输出中间的键值对

//输出<单词,1>
for(String word:words){
//1.write():Generate an output key/value pair.=>(KeyOut[Type is Text],ValueOut[Type is IntWritable])
context.write(new Text(word), new IntWritable(1));
}

???但是我不清楚的是,这个​​write()​​方法的真正实现在哪里。???



举报

相关推荐

0 条评论