Flume系列1：为什么需要flume以及flume的原理-CFANZ编程社区

1.为什么要有flume?

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.

译文：

Apache Flume是一个分布式的、可靠的、可用的系统，用于有效地收集、聚合和移动大量的日志数据，从许多不同的源到一个集中的数据存储。
Apache Flume的使用不仅限于日志数据聚合。由于数据源是可定制的，Flume可以用来传输大量的事件数据，包括但不限于网络流量数据、社交媒体生成的数据、电子邮件消息以及几乎任何可能的数据源。

Flume的设计宗旨是向类似hadoop分布式集群批量导入基于事件的海量数据。一个典型的例子就是利用flume从一组web服务器中收集日志文件，然后把这些文件中的日志事件转移到一个新的HDFS汇总文件中以做进一步的处理，所以flume的终点sink一般是HDFS,当然因为flume本生的灵活性，又可以将采集到的数据输出到HDFS、hbase、hive、kafka等众多外部存储系统中

2.flume的本质是什么？

A Flume event is defined as a unit of data flow having a byte payload and an optional set of string attributes. A Flume agent is a (JVM) process that hosts the components through which events flow from an external source to the next destination (hop).

译文：

Flume事件被定义为具有字节有效负载和一组可选字符串属性的数据流单元。Flume代理是托管组件的(JVM)进程，事件通过这些组件从外部源流到下一个目的地(跃点)。

A Flume source consumes events delivered to it by an external source like a web server. The external source sends events to Flume in a format that is recognized by the target Flume source. For example, an Avro Flume source can be used to receive Avro events from Avro clients or other Flume agents in the flow that send events from an Avro sink.

译文：

Flume源使用由外部源(如web服务器)交付给它的事件。外部源以目标源能识别的格式向其发送事件。例如，Avro Flume源可用于从Avro客户端接收Avro事件，或从Avro接收器发送事件的流中的其他Flume代理。

总之，要想使用 flume，就需使用flume的代理（agent）,Flume的代理agent是由持续运行的source(数据源)，sink(数据目标)以及channel（用于连接source源和sink目标地的通道）构成的Java进程。

Flume系列1：为什么需要flume以及flume的原理_flume

Flume分布式系统中最核心的角色是agent，flume采集系统就是由一个个agent所连接起来形成。每一个agent（进程）相当于一个数据传递员，内部有三个组件：

①Source：采集源，用于跟数据源对接，以获取数据。

②Sink：下沉地，采集数据的传送目的，用于往下一级agent传递数据或者往最终存储系统传递数据

③Channel：angent内部的数据传输通道，用于从source将数据传递到sink

注意：.Source 到 Channel 到 Sink之间传递数据的形式是Event事件；Event事件是一个数据流单元。面对具体不同的采集源，有不同的实现类，同理sink也是，所以不需要我们编程。

3.Flume的特性

3.1可靠性

The events are staged in a channel on each agent. The events are then delivered to the next agent or terminal repository (like HDFS) in the flow. The events are removed from a channel only after they are stored in the channel of next agent or in the terminal repository. This is a how the single-hop message delivery semantics in Flume provide end-to-end reliability of the flow.

Flume uses a transactional approach to guarantee the reliable delivery of the events. The sources and sinks encapsulate in a transaction the storage/retrieval, respectively, of the events placed in or provided by a transaction provided by the channel. This ensures that the set of events are reliably passed from point to point in the flow. In the case of a multi-hop flow, the sink from the previous hop and the source from the next hop both have their transactions running to ensure that the data is safely stored in the channel of the next hop.

译文： 事件暂存在每个代理的通道中。然后将事件传递到流中的下一个代理或终端存储库(如HDFS)。只有在将事件存储在下一个代理的通道或终端存储库中之后，才会从通道中删除事件。这就是Flume中的单跳消息传递语义如何提供流的端到端可靠性。
Flume使用事务性方法来保证事件的可靠传递。sources 和sinks 分别封装在事务中存储或检索,由通道提供的事务中放置的或由事务提供的事件。这确保了events在流中可靠地从一点传递到另一点。对于多跳流，来自上一跳的接收和来自下一跳的源都运行它们的事务，以确保数据安全地存储在下一跳的通道中。

3.2可恢复性

The events are staged in the channel, which manages recovery from failure. Flume supports a durable file channel which is backed by the local file system. There’s also a memory channel which simply stores the events in an in-memory queue, which is faster but any events still left in the memory channel when an agent process dies can’t be recovered.

译文：

事件暂存在通道中，该通道管理从失败中恢复数据。Flume支持由本地文件系统支持的持久文件通道。还有一个内存通道，它只是将事件存储在内存队列中，速度更快，但是当代理进程死亡时，任何留在内存通道中的事件都无法恢复。

4.flume实际开发中常见的采集系统结构图

Flume allows a user to build multi-hop flows where events travel through multiple agents before reaching the final destination. It also allows fan-in and fan-out flows, contextual routing and backup routes (fail-over) for failed hops.

Flume允许用户构建多跳流，其中事件在到达最终目的地之前通过多个代理传递。它还允许扇入和扇出流、上下文路由和失败跳转的备份路由(故障转移)。

1.单个

agent

采集数据。

2.复杂结构：多级agent之间串联

注意，多个agent级联时：一个source可以对接多个chanel,但是一个chanel只能对接一个sink

下面一章讲解flume的配置使用，以及核心原理分析

Flume系列1：为什么需要flume以及flume的原理

1.为什么要有flume?

2.flume的本质是什么？

​3.Flume的特性​

4.flume实际开发中常见的采集系统结构图

3.Flume的特性