Hadoop源码详解之Mapper类-CFANZ编程社区

Hadoop源码详解之`Mapper`类

1. 类释义

Maps input key/value pairs to a set of intermediate key/value pairs.
将输入的键值对应成一系列的中间键值对

Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.
Maps 将输入记录转换成一个中间的记录是单个任务。转换后的中间记录不需要与输入记录相同。一个给出的对可能映射0或多个输出对。

The Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. Mapper implementations can access the Configuration for the job via the JobContext.getConfiguration().
Hadoop 的 Map-Reduce框架对每一个job的InputFormat InputSplit产出一个 map任务。Mapper实现能够访问job的配置，通过 JobContext.getConfiguration() 。

The framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context), followed by map(Object, Object, org.apache.hadoop.mapreduce.Mapper.Context) for each key/value pair in the InputSplit. Finally cleanup(org.apache.hadoop.mapreduce.Mapper.Context) is called.
框架首先调用setup(…)，紧接着是map(…)针对每一个key/value对在InputSplit中。最后是调用 cleanup(...)

All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to a Reducer to determine the final output. Users can control the sorting and grouping by specifying two key RawComparator classes.
与给出的输出键关联的所有的中间值随后由框架分组，并传递给一个 Reducer 以确定最终输出。用户可以通过指定两个关键的RawComparator 类来控制排序和分组。 【什么是两个关键的RawComparator?】

The Mapper outputs are partitioned per Reducer. Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.
Mapper 输出被每一个Reducer 分区。用户能够控制哪一个键（因此控制的是记录）去到哪一个Reducer通过实现自定义的Partitioner

Users can optionally specify a combiner, via Job.setCombinerClass(Class), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.
用户能够选择一个指定的combiner，通过 Job.setCombinerClass(Class)，去执行一个本地的聚合，针对中间的输出，这能够帮助减少从Mapper到Reducer时传输数据的数量。

Applications can specify if and how the intermediate outputs are to be compressed and which CompressionCodecs are to be used via the Configuration.
应用能够使用Configuration 去指定是否以及怎样比较中间输出结果以及哪个CompressionCodecs会被使用。

2. 类源码

package org.apache.hadoop.mapreduce;

import java.io.IOException;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.mapreduce.task.MapContextImpl;

 /* 
 * @see InputFormat
 * @see JobContext
 * @see Partitioner  
 * @see Reducer
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

  /**
   * The <code>Context</code> passed on to the {@link Mapper} implementations.
   */
  public abstract class Context
    implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
  }

  /**
   * Called once at the beginning of the task.
   */
  protected void setup(Context context
                       ) throws IOException, InterruptedException {
    // NOTHING
  }

  /**
   * Called once for each key/value pair in the input split. Most applications
   * should override this, but the default is the identity function.
   */
  @SuppressWarnings("unchecked")
  protected void map(KEYIN key, VALUEIN value, 
                     Context context) throws IOException, InterruptedException {
    context.write((KEYOUT) key, (VALUEOUT) value);
  }

  /**
   * Called once at the end of the task.
   */
  protected void cleanup(Context context
                         ) throws IOException, InterruptedException {
    // NOTHING
  }

  /**
   * Expert users can override this method for more complete control over the
   * execution of the Mapper.
   * @param context
   * @throws IOException
   */
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
    }
  }
}

3. 类方法

3.1 内部类 `Context`

类释义

The Context passed on to the Mapper implementations.
传递给Mapper实现的Context

类代码

public abstract class Context
    implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
  }

可以看到这个抽象类Context 继承了MapContext 这个接口，其泛型是<KEYIN,VALUEIN,KEYOUT,VALUEOUT>。

而MapContext这个类又继承自TaskInputOutputContext，这个TaskInputOutputContext 类中有一个方法write()。在WordCountMapper类中，就是使用这个write()方法去输出中间的键值对。

//输出<单词，1>
for(String word:words){
 //1.write():Generate an output key/value pair.=>(KeyOut[Type is Text],ValueOut[Type is IntWritable])
 context.write(new Text(word), new IntWritable(1));
}

？？？但是我不清楚的是，这个write()方法的真正实现在哪里。？？？