《Hadoop 读书笔记》之八— chapter 8-CFANZ编程社区

《`Hadoop` 读书笔记》之八— `chapter` 8

`FileInputFormat`类精解

FileInputFormat is the base class for all implementations of InputFormat that use files as their data source .

FileInputFormat 是所有使用files 作为数据源的InputFormat 的实现基类

FileInputFormat offers four static convenience methods for setting a Job’s input paths:

public static void addInputPath(Job job, Path path)
public static void addInputPaths(Job job, String commaSeparatedPaths)
public static void setInputPaths(Job job, Path... inputPaths)
public static void setInputPaths(Job job, String commaSeparatedPaths)

A path representing a directory includes all the files in the directory as input to the job.

这句话的意思就是说：如果一个path是一个目录，那么这个目录下的所有文件都将作为job的输入

2.Why does Hadoop works better with a small number of large files than a large number of small files ?

01.FileInputFormat generates splits in such a way that each split is all or part of a single file

If the file is very small (“small” means significantly smaller than an HDFS block) and there are a lot of them, each map task will process very little input, and there will be a lot of them (one per file),each of which imposes extra bookkeeping overhead.

02.processing many small files increases the number of seeks that are needed to run a job

03.Also, storing large numbers of small files in HDFS is wasteful of the namenode’s memory.

如何减轻小文件带来的影响

CombineFileInputFormat,which was designed to work well with small files.Where FileInputFormat creates a split per file, CombineFileInputFormat packs many files into each split so that each mapper has more to process.

CombineFileInputFormat takes node and rock locality into account when deciding which blocks to place in the same split.so it does not compromise the speed at which it can process the input in a typical MapReduce job.
所以它不会影响普通MapReduce job 中处理输入的速度。

02.One technique for avoiding the many small files case is to merge small files into larger files by using a sequence file.with this approach, the keys can act as filenames (or a constant such as NullWritable, if not needed) and the values as file contents.

3.MapReduce job 的性能

MapReduce works best when it can operate at the transfer rate of the disks in the cluster.

这句话的意思是说： MapReduce job 以集群磁盘的传输速率运行时，性能最好。【不仅仅是传输时间，还有寻址时间，所以导致不仅仅消耗的是传输时间，但是当寻址时间很小而且可以忽略不急时，就可以近似任务传输时间就是整个MapReduce job获取数据的时间】

Preventing splitting
01.某些时候不想分割输入文件，就在一个mapper 中处理，那该怎么办呢？
02.a simple way to check if all the records in a file are sorted is to go through the records in order,
思考一下为什么需要将 all the records in a file to

确保存在的文件不会被分割有哪些方法呢？

01.The first (quick-and-dirty) way is to increase the minimum split size to be larger than the largest file in your system. Setting it to its maximum value, Long.MAX_VALUE, has this effect.

把 minimum split size 设置成比你系统中最大文件还要大。将它设置成它的最大值 —— Long.MAX_VALUE ，将会起到效果。

02.The second is to subclass the concrete subclass of FileInputFormat that you want to use, to override the isSplitable() method[57] to return false.

第二个方法是：成为你所想要使用的FileInputFormat具体子类的子类，同时重写 isSplitable() 方法，将其设置成返回false.

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
public class NonSplittableTextInputFormat extends TextInputFormat {
  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    return false;
  }
}

5.File information in the mapper

A mapper processing a file input split can find information about the split by calling the getInputSplit() method on the Mapper’s Context object.

一个处理文件输入分割块的mapper能够查看关于分割块的信息，通过调用Mapper的Context 对象的 getInputSplit()方法。

6.How to use a FileSplit when we need to access the split’s filename

7.Text Input

Hadoop excels at processing unstructured text. InputFormats that Hadoop provides to process text.

7.1TextInputFormat

TextInputFormat is the default InputFormat.

The value is the contents of the line, excluding any line terminators.[newline or carriage return]

意思是说：TextInputFormat 输入格式类会把每行读取的内容作为一个value值，但是不会把值中的换行符和回车符包括在内。

keys 并非是行号。为什么不用行号作为其keys呢？

This would be impossible to implement in general, in that a file is broken into splits at byte, not line, boundaries. Splits are processed independently. Line numbers are really a sequential notion. You have to keep a count of lines as you consume them, so knowing the line number within a split would be possible, but not within the file.

主要的原因就是：

01.文件根据字节被划分，而不是根据行号或者是边界划分。分割操作经常被单独处理。行号实际上是一个顺序的概念。你在使用这些行时，必须对行数保持一个计数，因此在一个分割块中维持行号是可能的，但是在一个文件（可能）有多个分割块却是不大可能的。

02.但是如果使用位移量作为keys时，因为每个split的大小时固定的（128MB），所以我们就可以知道一个split 的偏移量是 128MB * 2^20.

Controlling the maximum line length

问题就是：如果某行太长了，就容易导致values 超出内存限制，然后导致程序报错。

01.By setting mapreduce.input.linerecordreader.line.maxlength to a value in bytes that fits in memory (and is comfortably greater than the length of lines in your input data), you ensure that the record reader will skip the (long) corrupt lines without the task failing.

7.2KeyValueTextInputFormat

It is common for each line in a file to be a key-value pair, separated by a delimiter such as a tab character. For example, this is the kind of output produced by TextOutputFormat, Hadoop’s default OutputFormat. To interpret such files correctly, KeyValueTextInputFormat is appropriate.

#####7.3 NLineInputFormat

With TextInputFormat and KeyValueTextInputFormat, each mapper receives a variable number of lines of input. The number depends on the size of the split and the length of the lines.

TextInputFormat 和 KeyValueTextInputFormat 他们的mapper 都是接收一个输入行的变量。这个变量取决于split 的大小以及行的长短。

01.为什么叫NLineInputFormat?

N refers to the number of lines of input that each mapper receives.

N 指的是每个mapper接收输入的行数

02.N 默认的值是1，每个mapper 仅仅接收输入的每一行

mapreduce.input.lineinputformat.linespermap 属性控制 N 的值。

03.N 改变的仅仅是每个mapper 接收的行数。但是每个mapper的对和TextInputFormat 是相同的。

####7.4 NLineInputFormat 和 TextInputFormat 两者有什么区别？

01.in the way the splits are constructed.

#####7.5XML

if a large XML document is made up of mulitiple input splits, it is a challenge to parse these individually.you can process the entire XML document in one mapper.

Binary Input

01.SequenceFileInputFormat

Sequence files are well suited as a format for MapReduce data because they are splittable.

Sequence File 的特点

they support compression as a part of the format,

and they can store arbitrary types using a variety of serialization frameworks