《Hadoop权威指南》读书笔记9 — Chapter 9-CFANZ编程社区

《`Hadoop` 权威指南》读书笔记9 — `Chapter 9`

Counters

There are often things that you would like to know about the data you are analyzing but that are peripheral to the analysis you are performing.

Counters are a useful channel for gathering statistics about the job: for quality control or for application-level statistics.

Task Counters

Task counters are sent in full every time, rather than sending the counts since the last transmission, since this guards against errors due to lost messages

counter 数是永远增加么？

不是，如果在一个job运行的过程中，task失败了，那么会导致counters 变小。

Job Counters

Job counters (Table 9-6) are maintained by the application master,so they don’t need to be sent across the network, unlike all other counters, including user-defined ones.

01.measure job-level statistics.【如果是一个task completed，不会对这个counter 有什么影响】

User-Defined Java Counters

01.The name of the enum is the group name, and the enum’s fields are the counter names.

02.Counters are global: the MapReduce framework aggregates them across all maps and reduces to produce a grand total at the end of the job

Dynamic counters

01.The code makes use of a dynamic counter — one that isn’t defined by a Java enum.

Because a Java enum’s fields are defined at compile time, you can’t create new counters on the fly using enums.

public Counter getCounter(String groupName, String counterName)

关键的问题是：这个 Dynamic counters 该怎么使用？

Retrieving counters

有哪些方法？

01.using the web UI

02.using the command line

03.using the java API

三种方式具体分别是如何操作的？

Sort

The ability to sort data is at the heart of MapReduce.

01.we examine different ways of sorting datasets

02.how you can control the sort order in MapReduce

03.because signed integers don’t sort lexicographically

控制排序顺序

The sort order for keys is controlled by a RawComparator.

01.If the property mapreduce.job.output.key.comparator.class is set, either explicitly or by calling setSortComparatorClass() on Job, then an instance of that class is used.

02.keys must be a subclass of WritableComparable,and the registered comparator for the keys class is used.

03.If there is no registered comparator, then a RawComparator is used.

Partial Sort

Total Sort

01.The naive answer is to use a single partition .

问题是：this is incredibly inefficient for large files.

02.it is possible to produce a set of sorted files that, if concatenated, would form a globally sorted file.

The secret to doing this is to use a partitioner that respects the total order of the output.

例如：如果我们有四个分区，划分分区的数据如下：

001.< -10

002.[-10,0)

003.[0,10)

004.>= 10

03.如果想构建十分平衡的分区，则需要对待分析的数据有一个很好的预估。

预估数据的方法有很多，经典的方法就是使用抽样调查。

It’s possible to get a fairly even set of partitions by sampling the key space.

Hadoop comes with a selection of samplers. The InputSampler class defines a nested Sampler interface whose implementations return a sample of keys given an InputFormat and Job:

public interface Sampler<K, V> {

K[] getSample(InputFormat<K, V> inf, Job job)

throws IOException, InterruptedException;

}

各种Sampler的使用

《Hadoop权威指南》读书笔记9 — Chapter 9

《​​Hadoop​​​ 权威指南》读书笔记9 — ​​Chapter 9​​

《`Hadoop` 权威指南》读书笔记9 — `Chapter 9`