大数据-Flume知识点介绍-CFANZ编程社区

Flume简介

Flume 定义
Flume 是 Cloudera 提供的一个高可用的，高可靠的，分布式的海量日志采集、聚合和传输的系统。Flume 基于流式架构，灵活简单。
Flume最主要的作用就是，实时读取服务器本地磁盘的数据，将数据写入到HDFS。
Apache软件基金顶级项目
Apache Flume是一个分布式、可信任的弹性系统，用于高效收集、汇聚和移动大规模日志信息从多种不同的数据源到一个集中的数据存储中心
• 功能：
– 支持在日志系统中定制各类数据发送方，用于收集数据
– Flume提供对数据进行简单处理，并写到各种数据接收方（可定制）的能力
• 多种数据源：
– Console、RPC、Text、Tail、Syslog、Exec等
特点
• Flume可以高效率的将多个网站服务器中收集的日志信息存入HDFS/HBase中
• 使用Flume，我们可以将从多个服务器中获取的数据迅速的移交给Hadoop中
• 除了日志信息，Flume同时也可以用来接入收集规模宏大的社交网络节点事件数据，比如facebook,twitter,电商网站如亚马逊，flipkart等
• 支持各种接入资源数据的类型以及接出数据类型
• 支持多路径流量，多管道接入流量，多管道接出流量，上下文路由等
• 可以被水平扩展

Flume外部架构

在这里插入图片描述
• 数据发生器（如：facebook,twitter）产生的数据被单个的运行在数据发生器所在服务器上的agent所收集，之后数据收容器从各个agent上汇集数据并将采集到的数据存入到HDFS或者HBase中。
事件(Flume Event)
• Flume使用Event对象来作为传递数据的格式，是内部数据传输的最基本单元
• 由两部分组成：转载数据的字节数组+可选头部
在这里插入图片描述
• Header 是 key/value 形式的，可以用来制定路由决策或携带其他结构化信息(如事件的时间戳或事件来源的服务器主机名)。你可以把它想象成和 HTTP 头一样提供相同的功能——通过该方法来传输正文之外的额外信息。Flume提供的不同source会给其生成的event添加不同的header
• Body是一个字节数组，包含了实际的内容

代理(Flume Agent)
• Flume内部有一个或者多个Agent
• 每一个Agent是一个独立的守护进程(JVM)
• 从客户端那儿接收收集，或者从其他的Agent那儿接收，然后迅速的将获取的数据传给下一个目的节点Agent
在这里插入图片描述
• Agent主要由source、channel、sink三个组件组成。

Source

Source 是负责接收数据到 Flume Agent 的组件。Source 组件可以处理各种类型、各种格式的日志数据，包括 avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy。
• 一个Flume源
• 负责一个外部源（数据发生器），如一个web服务器传递给他的事件
• 该外部源将它的事件以Flume可以识别的格式发送到Flume中
• 当一个Flume源接收到一个事件时，其将通过一个或者多个通道存储该事件

Channel

Channel 是位于 Source 和 Sink 之间的缓冲区。因此，Channel 允许 Source 和 Sink 运作在不同的速率上。Channel 是线程安全的，可以同时处理几个 Source 的写入操作和几个Sink 的读取操作。
Flume 自带两种 Channel：Memory Channel 和 File Channel。
Memory Channel 是内存中的队列，吞吐率极高，但存在丢数据风险。Memory Channel 在不需要关心数据丢失的情景下适用。如果需要关心数据丢失，那么 Memory Channel 就不应该使用，因为程序死亡、机器宕机或者重启都会导致数据丢失。
File Channel 将所有事件写到磁盘（WAL实现）。因此在程序关闭或机器宕机的情况下不会丢失数据。
• 通道：采用被动存储的形式，即通道会缓存该事件直到该事件被sink组件处理
• 所以Channel是一种短暂的存储容器，它将从source处接收到的event格式的数据缓存起来,直到它们被sinks消费掉,它在source和sink间起着一个桥梁的作用,channal是一个完整的事务,这一点保证了数据在收发的时候的一致性. 并且它可以和任意数量的source和sink链接
• 可以通过参数设置event的最大个数
• Flume通常选择FileChannel，而不使用Memory Channel

Sink

Sink 不断地轮询 Channel 中的事件且批量地移除它们，并将这些事件批量写入到存储或索引系统、或者被发送到另一个 Flume Agent。
Sink 组件目的地包括 hdfs、logger、avro、thrift、ipc、file、HBase、solr、自定义。
Sink会将事件从Channel中移除，并将事件放置到外部数据介质上
例如：通过Flume HDFS Sink将数据放置到HDFS中，或者放置到下一个Flume的Source，等到下一个Flume处理。
– 对于缓存在通道中的事件，Source和Sink采用异步处理的方式
• Sink成功取出Event后，将Event从Channel中移除
• Sink必须作用于一个确切的Channel
• 不同类型的Sink：
– 存储Event到最终目的的终端：HDFS、Hbase
– 自动消耗：Null Sink
– 用于Agent之间通信：Avro

Agent Interceptor

Interceptor用于Source的一组拦截器，按照预设的顺序必要地对events进行过滤和自定义的处理逻辑实现
在app(应用程序日志)和 source 之间的，对app日志进行拦截处理的。也即在日志进入到source之前，对日志进行一些包装、清新过滤等等动作
官方上提供的已有的拦截器有：
– Timestamp Interceptor：在event的header中添加一个key叫：timestamp,value为当前的时间戳
– Host Interceptor：在event的header中添加一个key叫：host,value为当前机器的hostname或者ip
– Static Interceptor：可以在event的header中添加自定义的key和value
– Regex Filtering Interceptor：通过正则来清洗或包含匹配的events
– Regex Extractor Interceptor：通过正则表达式来在header中添加指定的key,value则为正则匹配的部分
flume的拦截器也是chain形式的，可以对一个source指定多个拦截器，按先后顺序依次处理

Agent Selector

channel selectors 有两种类型:
Replicating Channel Selector (default)：将source过来的events发往所有channel
Multiplexing Channel Selector：而Multiplexing 可以选择该发往哪些channel
对于有选择性选择数据源，明显需要使用Multiplexing 这种分发方式
Multiplexing 需要判断header里指定key的值来决定分发到某个具体的channel，我们现在demo1和demo2同时运行在同一个服务器上，如果在不同的服务器上运行，我们可以在 source1上加上一个 host 拦截器，这样可以通过header中的host来判断event该分发给哪个channel，而这里是在同一个服务器上，由host是区分不出来日志的来源的，我们必须想办法在header中添加一个key来区分日志的来源
–通过设置上游不同的Source就可以解决

可靠性

• flume保证单次跳转可靠性的方式：传送完成后，该事件才会从通道中移除
• Flume使用事务性的方法来保证事件交互的可靠性。
• 整个处理过程中，如果因为网络中断或者其他原因，在某一步被迫结束了，这个数据会在下一次重新传输。
• Flume可靠性还体现在数据可暂存上面，当目标不可访问后，数据会暂存在Channel中，等目标可访问之后，再进行传输
• Source和Sink封装在一个事务的存储和检索中，即事件的放置或者提供由一个事务通过通道来分别提供。这保证了事件集在流中可靠地进行端到端的传递。
– Sink开启事务
– Sink从Channel中获取数据
– Sink把数据传给另一个Flume Agent的Source中
– Source开启事务
– Source把数据传给Channel
– Source关闭事务
– Sink关闭事务可靠性

Flume实践

netcat方式
exec方式
HDFS方式
spooldir

Flume事务

在这里插入图片描述

Flume 的事务机制（类似数据库的事务机制）：Flume 使用两个独立的事务分别负责从Soucrce 到 Channel，以及从 Channel 到 Sink 的事件传递。比如 spooling directory source为文件的每一行创建一个事件，一旦事务中所有的事件全部传递到 Channel 且提交成功，那么 Soucrce 就将该文件标记为完成。同理，事务以类似的方式处理从 Channel 到 Sink 的传递过程，如果因为某种原因使得事件无法记录，那么事务将会回滚。且所有的事件都会保持到 Channel 中，等待重新传递。
根据 Flume 的架构原理，Flume 是不可能丢失数据的，其内部有完善的事务机制，Source 到 Channel 是事务性的，Channel 到 Sink 是事务性的，因此这两个环节不会出现数据的丢失，唯一可能丢失数据的情况是 Channel 采用 memory Channel，agent 宕机导致数据丢失，或者 Channel 存储数据已满，导致 Source 不再写入，未写入的数据丢失。Flume 不会丢失数据，但是有可能造成数据的重复，例如数据已经成功由 Sink 发出，但是没有接收到响应，Sink 会再次发送数据，此时可能会导致数据的重复。

Flume Agent内部原理

在这里插入图片描述

重要组件：
1）ChannelSelector
ChannelSelector 的作用就是选出 Event 将要被发往哪个 Channel。其共有两种类型，分别是 Replicating（复制）和 Multiplexing（多路复用）。ReplicatingSelector 会将同一个 Event 发往所有的 Channel，Multiplexing 会根据相应的原则，将不同的 Event 发往不同的 Channel。
2）SinkProcessor
SinkProcessor
共有三种类型，分别是DefaultSinkProcessor 、LoadBalancingSinkProcessor 和 FailoverSinkProcessor
DefaultSinkProcessor 对应的是单个的 Sink ， LoadBalancingSinkProcessor 和FailoverSinkProcessor 对应的是 Sink Group，LoadBalancingSinkProcessor 可以实现负载均衡的功能，FailoverSinkProcessor 可以实现故障转移的功能。

自定义Interceptor

自定义类实现Interceptor接口

package com.atguigu.flume.interceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import java.util.List;

public class CustomInterceptor implements Interceptor {

    @Override
    public void initialize() {
    }

    @Override
    public Event intercept(Event event) {
        byte[] body = event.getBody();
        if (body[0] < 'z' && body[0] > 'a') {
            event.getHeaders().put("type", "letter");
        } else if (body[0] > '0' && body[0] < '9') {
            event.getHeaders().put("type", "number");
        }
        return event;
    }

    @Override
    public List<Event> intercept(List<Event> events) {
        for (Event event : events) {
            intercept(event);
        }
        return events;
    }

    @Override
    public void close() {
    }

    public static class Builder implements Interceptor.Builder {
        @Override
        public Interceptor build() {
            return new CustomInterceptor();
        }
        @Override
        public void configure(Context context) {
        }
    }
}

将写好的代码打包，并放到 flume 的 lib 目录（/opt/module/flume）下
编辑配置文件，指定过滤器

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type =com.atguigu.flume.interceptor.CustomInterceptor$Builder
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.letter = c1
a1.sources.r1.selector.mapping.number = c2

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4141
a1.sinks.k2.type=avro
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 4242

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Use a channel which buffers events in memory
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

自定义Source

自定义类继承 AbstractSource 类并实现 Configurable 和 PollableSource 接口
实现相应方法：
getBackOffSleepIncrement()//暂不用
getMaxBackOffSleepInterval()//暂不用
configure(Context context)//初始化 context（读取配置文件内容）
process()//获取数据封装成 event 并写入 channel，这个方法将被循环调用。

package com.atguigu;

import org.apache.flume.Context;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.PollableSource;
import org.apache.flume.conf.Configurable;
import org.apache.flume.event.SimpleEvent;
import org.apache.flume.source.AbstractSource;
import java.util.HashMap;

public class MySource extends AbstractSource implements Configurable, PollableSource {

    //定义配置文件将来要读取的字段
    private Long delay;
    private String field;
    
    //初始化配置信息
    @Override
    public void configure(Context context) {
        delay = context.getLong("delay");
        field = context.getString("field", "Hello!");
    }

    @Override
    public Status process() throws EventDeliveryException {
        try {
            //创建事件头信息
            HashMap<String, String> hearderMap = new HashMap<>();
            //创建事件
            SimpleEvent event = new SimpleEvent();
            //循环封装事件
            for (int i = 0; i < 5; i++) {
                //给事件设置头信息
                event.setHeaders(hearderMap);
                //给事件设置内容
                event.setBody((field + i).getBytes());
                //将事件写入 channel
                getChannelProcessor().processEvent(event);
                Thread.sleep(delay);
            }
        } catch (Exception e) {
            e.printStackTrace();
            return Status.BACKOFF;
        }
        return Status.READY;
    }

    @Override
    public long getBackOffSleepIncrement() {
        return 0;
    }

    @Override
    public long getMaxBackOffSleepInterval() {
        return 0;
    }
}

将写好的代码打包，并放到 flume 的 lib 目录（/opt/module/flume）下
配置文件

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = com.atguigu.MySource
a1.sources.r1.delay = 1000
a1.sources.r1.field = atguigu

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

自定义Sink

自定义类继承 AbstractSink 类并实现 Configurable 接口
实现相应方法：
configure(Context context)//初始化 context（读取配置文件内容）
process()//从 Channel 读取获取数据（event），这个方法将被循环调用。

package com.atguigu;
import org.apache.flume.*;
import org.apache.flume.conf.Configurable;
import org.apache.flume.sink.AbstractSink;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class MySink extends AbstractSink implements Configurable {
    
    //创建 Logger 对象
    private static final Logger LOG = LoggerFactory.getLogger(AbstractSink.class);
    private String prefix;
    private String suffix;

    @Override
    public Status process() throws EventDeliveryException {
        //声明返回值状态信息
        Status status;
        //获取当前 Sink 绑定的 Channel
        Channel ch = getChannel();
        //获取事务
        Transaction txn = ch.getTransaction();
        //声明事件
        Event event;
        //开启事务
        txn.begin();
        //读取 Channel 中的事件，直到读取到事件结束循环
        while (true) {
            event = ch.take();
            if (event != null) {
                break;
            }
        }
        try {
            //处理事件（打印）
            LOG.info(prefix + new String(event.getBody()) + suffix);
            //事务提交
            txn.commit();
            status = Status.READY;
        } catch (Exception e) {
            //遇到异常，事务回滚
            txn.rollback();
            status = Status.BACKOFF;
        } finally {
            //关闭事务
            txn.close();
        }
        return status;
    }

    @Override
    public void configure(Context context) {
        //读取配置文件内容，有默认值
        prefix = context.getString("prefix", "hello:");
        //读取配置文件内容，无默认值
        suffix = context.getString("suffix");
    }
}

将写好的代码打包，并放到 flume 的 lib 目录（/opt/module/flume）下
配置文件

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = com.atguigu.MySink
a1.sinks.k1.prefix = atguigu:
a1.sinks.k1.suffix = :atguigu

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1