hystrix，接口的隔离、熔断、与降级-CFANZ编程社区

本篇文章来自于对开涛《亿级流量网站架构核心技术》一书中hystrix相关内容的学习记录。

前言：部分故障不可影响整体可用

系统中的某部分服务不可用，不能使得整个系统不可用。解决这类问题的方法通常有隔离（熔断）或者异步非阻塞。
以“远程调用依赖的服务”这个场景为例，可以把这个依赖服务单独放在一个java进程或者说应用里边，然后通过服务发现机制去轮询这个进程服务的健康状态、并将状态同步到调用者来达到故障的及时发现与熔断。这种方案叫进程隔离。或者我们在调用端使用WebClient这种reactive异步非阻塞的技术栈，使得调用过程不阻塞调用线程，对端不可用时不会挂死调用者的worker线程池。

而hystrix采用的是隔离的思路，用的是进程内的线程池隔离，同时也提供了接口服务的熔断和降级机制。下面通过例子来说明。

1、hystrix的隔离作用

引入hystrix依赖：

implementation 'com.netflix.hystrix:hystrix-core:1.5.18'

项目的service层的方法，方法内是调用一个远程的依赖服务

    //模拟一个远端库存服务调用
    public String getStock() {
        logger.info("执行service, getStock...");
        return restTemplate.getForObject("http://localhost:8081/serverside/backend/test2", String.class);
    }

我们假设8081上的这个backend接口挂了，那么在超时时间内，我们这个工程里的tomcat worker线程池都会阻塞在这个网络调用上，或者说阻塞在getStock方法上，当有大量并发请求时，线程池满，等待队列大量积压。这时候应用里的其他服务也不能用了。
hystrix使用命令模式来包装我们的业务逻辑方法，然后在hystrix为这个业务服务独立分配的线程池中执行，这样就起到了线程池的隔离的作用，这个服务挂了只是影响了这个独立的线程池，而工程里的其他服务仍是可以用的（用他们自己的线程池或者tomcat的worker线程池）。代码如下

public class ServersideServiceCommand extends HystrixCommand<String> {
    private Logger logger = LoggerFactory.getLogger(ServersideServiceCommand.class);
    private ServerSideService serversideService;

    public ServersideServiceCommand(ServerSideService serversideService) {
        super(setter());
        this.serversideService = serversideService;
    }

    private static Setter setter() {
        //服务分组和标识
        HystrixCommandGroupKey groupKey = HystrixCommandGroupKey.Factory.asKey("serverside");
        HystrixCommandKey commandKey = HystrixCommandKey.Factory.asKey("getStock");
        //线程池配置
        HystrixThreadPoolKey threadPoolKey = HystrixThreadPoolKey.Factory.asKey("serverside-pool");
        HystrixThreadPoolProperties.Setter threadpoolPropertiesSetter = HystrixThreadPoolProperties.Setter()
                                                    .withCoreSize(10)
                                                    .withKeepAliveTimeMinutes(5)
                                                    .withMaxQueueSize(10)
                                                    .withQueueSizeRejectionThreshold(10);
        //命令属性配置
        HystrixCommandProperties.Setter commandPropertiesSetter = HystrixCommandProperties.Setter()
                                                    .withExecutionIsolationStrategy(HystrixCommandProperties.ExecutionIsolationStrategy.THREAD);
        //组装成HystrixCommand.Setter返回
        return HystrixCommand.Setter.withGroupKey(groupKey)
                                    .andCommandKey(commandKey)
                                    .andThreadPoolKey(threadPoolKey)
                                    .andThreadPoolPropertiesDefaults(threadpoolPropertiesSetter)
                                    .andCommandPropertiesDefaults(commandPropertiesSetter);
    }

    @Override
    protected String run() throws Exception{
        logger.info("执行command.run...");
        return this.serversideService.getStock();
    }
}

概括下上面的代码，我们要实现一个继承了HystrixCommand<R>的command类，R代表要包装的原来的接口的返回类型。然后在command类里边主要是@Override run()方法，在里边去调用我们的业务逻辑接口。另外，我们也要提供command类的构造方法，在里边要设置一个HystrixCommand.Setter来调用父类的HystrixCommand(Setter setter)构造方法，来完成command类的主要配置和实例化。

关于Setter配置参数的几个说明：

HystrixCommandGroupKey、HystrixCommandKey分别可以认为是依赖的子系统和这个子系统的服务方法，比如上面是“serverside这个子系统的getStock服务”
HystrixThreadPoolKey 独立的这个隔离线程池的名字，默认是用HystrixCommandGroupKey名字，也就是说默认是一个子系统用一个隔离线程池的，当然我们可以更细粒度的隔离，比如设置到HystrixCommandKey也就是每个服务方法一个线程池。command配置的这个线程池名字相同则用同样的一个线程池。
HystrixThreadPoolProperties 线程池属性配置，这个就比较熟悉了，核心大小默认也是最大大小、线程最大空闲时间当然默认也是没有意义的、因为不会超过coreSize，队列最大长度MaxQueueSize但实际起作用的是QueueSizeRejectionThreshold
HystrixCommandProperties 是command命令熟悉配置，比如说配置是使用线程池隔离还是信号量隔离，这组属性还有很多其他配置项，在熔断和降级特性中要用到。

调用一下

@RequestMapping(value = "/getStock", method = RequestMethod.GET)
public String getStock() {
        logger.info("开始调用controller, getStock");
        ServersideServiceCommand command = new ServersideServiceCommand(serverSideService);
        String response = command.execute();
        logger.info("controller返回" + response);
        return response;
    }

[2021-09-14 21:21:23] [ INFO ] [http-nio-8080-exec-2] [traceId:bb2159cbef0942d1bf6985e9be7f5345] com.wangan.controller.WanganController [codeline:33] - 开始调用controller, getStock
[2021-09-14 21:21:24] [ INFO ] [hystrix-serverside-pool-1] [traceId:] com.hfi.service.ServersideServiceCommand [codeline:49] - 执行command.run...
[2021-09-14 21:21:24] [ INFO ] [hystrix-serverside-pool-1] [traceId:] com.hfi.service.ServerSideService [codeline:49] - 执行service, getStock...
[2021-09-14 21:21:24] [ INFO ] [http-nio-8080-exec-2] [traceId:bb2159cbef0942d1bf6985e9be7f5345] com.wangan.controller.WanganController [codeline:36] - controller返回response from backend service

可以看到tomcat worker线程中切换交给hystrix线程执行然后又切回tomcat worker线程执行，这实际上是同步且阻塞的，只不过最多只会阻塞10个worker线程，这就是隔离。另外，从日志可以看到在hystrix线程里打的log里边traceId没了，这也不难理解，由于切换了线程执行，logback MDC利用的当前线程的ThreadLocal，我们工程是在一开始交给某一个worker线程执行请求任务之前把traceId写入ThreadLocal的，现在执行线程变成hystrix的线程当然没有值了。
解决办法与spring的线程池ThreadPoolTaskExecutor解决这类问题所用的方法类似，对执行的runnable/callable进行包装和delegate，也是一种aop的思路。参考：https://www.jianshu.com/p/2b070568ff89

装饰HystrixConcurrencyStrategy：

public class MdcHystrixConcurrencyStrategy extends HystrixConcurrencyStrategy{
    private Logger logger = LoggerFactory.getLogger(MdcHystrixConcurrencyStrategy.class);
    
    @Override
    public <T> Callable<T> wrapCallable(Callable<T> callable) {
        return new MdcAwareCallable<>(callable, MDC.getCopyOfContextMap());
    }
    
    private class MdcAwareCallable<T> implements Callable<T> {

        private final Callable<T> delegate;

        private final Map<String, String> contextMap;

        public MdcAwareCallable(Callable<T> callable, Map<String, String> contextMap) {
            this.delegate = callable;
            this.contextMap = contextMap != null ? contextMap : new HashMap<String, String>();
        }

        @Override
        public T call() throws Exception {
            try {
                MDC.setContextMap(contextMap);
                return delegate.call();
            } finally {
                MDC.clear();
            }
        }
    }
}

注册这个hystrix插件：

/**
 * 注册hystrix插件
 * */
@Configuration
public class HystrixPluginsRegister {
    private static Logger logger = LoggerFactory.getLogger(HystrixPluginsRegister.class);

    /*
        //这里无法加载执行
    static {
        System.out.println("开始注册hystrix插件...");
        HystrixPlugins.getInstance().registerConcurrencyStrategy(new MdcHystrixConcurrencyStrategy());
    }
    */
    @PostConstruct
    public void init() {
        logger.info("开始注册hystrix插件....");
        HystrixPlugins.getInstance().registerConcurrencyStrategy(new MdcHystrixConcurrencyStrategy());
    }
}

2、hystrix的降级功能

command类里边run()方法装饰执行我们的业务逻辑接口，当发生异常的时候，执行getFallback()方法执行降级逻辑。
先配置一下HystrixCommandProperties：

//命令属性配置
HystrixCommandProperties.Setter commandPropertiesSetter = HystrixCommandProperties.Setter()
                                            .withExecutionIsolationStrategy(HystrixCommandProperties.ExecutionIsolationStrategy.THREAD)
                                            .withFallbackEnabled(true)
                                            .withFallbackIsolationSemaphoreMaxConcurrentRequests(100) //default 10
                                            .withExecutionTimeoutEnabled(true)
                                            .withExecutionTimeoutInMilliseconds(3000) //default 1000
                                            .withExecutionIsolationThreadInterruptOnTimeout(true)
                                            .withExecutionIsolationThreadInterruptOnFutureCancel(true); //default false

然后在command类里增加Override getFallback()方法：

@Override
protected String getFallback() {
    return "response from backend service（降级）";
}

介绍一下HystrixCommandProperties几个跟降级有关的配置：

withFallbackEnabled 是否开启降级，默认true
withFallbackIsolationSemaphoreMaxConcurrentRequests 降级逻辑的最大并发度，默认10，超过以后降级逻辑也不执行了，直接返回失败
withExecutionTimeoutEnabled 是否开启执行超时控制，默认true
withExecutionTimeoutInMilliseconds 执行超时时间，默认1000，如果是线程隔离则中断线程，如果是信号量隔离模式则不会中断线程、只终止操作，涉及网络访问的情况要注意。
withExecutionIsolationThreadInterruptOnTimeout 执行超时则中断线程，默认true
withExecutionIsolationThreadInterruptOnFutureCancel 执行的Future.cancel()的时候中断线程，默认false

command执行的时候，从hystrix线程池拿一个thread来执行任务，并且会有一个hystrix-timer线程去监听任务执行，一旦超时timer则中断线程并且timer线程也会去调用getFallback降级逻辑。

3、hystrix的熔断功能

终于到了hystrix的本命功能了，前面讲了隔离、降级，但是当依赖服务发生故障的时候，隔离也好降级也好，都是发生了“任务交给隔离线程池 -> 尝试执行依赖服务业务逻辑 -> 执行超时 -> 执行降级逻辑”这样一个过程的，换句话说仍然是耗费了一定的算力给已经故障了的服务。
最好是能按一定规则当判别到依赖服务不可用的时候、能够不执行任何尝试的直接失败或走fallback逻辑，而按照一定的频度和时间窗口去尝试调用依赖服务是否状态发生了改变、比如重新可用。这样一来，当大量并发调用时候收益非常明显，减少了很多不必要的尝试。因为尝试本身也是耗费成本的，所以这无疑保护了系统。这就是断路器的作用。
hystrix断路器熔断功能需要以下主要配置，同样也是配在HystrixCommandProperties上的，加上之前介绍的隔离、降级功能，完整的command配置如下：

/**
 * 对特定的service类，使用命令模式封装成一个command类，在run方法里执行原service的方法
 * */
public class ServersideServiceCommand extends HystrixCommand<String> {
    private Logger logger = LoggerFactory.getLogger(ServersideServiceCommand.class);
    private ServerSideService serversideService;

    public ServersideServiceCommand(ServerSideService serversideService) {
        super(setter());
        this.serversideService = serversideService;
    }

    private static Setter setter() {
        //服务分组和标识
        HystrixCommandGroupKey groupKey = HystrixCommandGroupKey.Factory.asKey("serverside"); //依赖serverside子系统
        HystrixCommandKey commandKey = HystrixCommandKey.Factory.asKey("getStock"); //serverside子系统的getStock接口服务
        //线程池配置
        HystrixThreadPoolKey threadPoolKey = HystrixThreadPoolKey.Factory.asKey("serverside-pool"); //相同threadPoolKey对应的是同一个线程池
        HystrixThreadPoolProperties.Setter threadpoolPropertiesSetter = HystrixThreadPoolProperties.Setter()
                                                    .withCoreSize(10) //假设依赖接口平均50ms返回，那么这个command的tps约为200
                                                    .withMaximumSize(50) //根据硬件配置（cpu）以及是否是重点依赖服务来配置最大运行线程数，同步模式下超过tomcat最大线程数也没意义了
                                                    .withKeepAliveTimeMinutes(5) //固定大小线程池这个参数没有意义
                                                    .withMaxQueueSize(100) //队列最大size
                                                    .withQueueSizeRejectionThreshold(10); //真正的控制队列里允许多少个任务
        //命令属性配置
        HystrixCommandProperties.Setter commandPropertiesSetter = HystrixCommandProperties.Setter()
                                                    .withExecutionIsolationStrategy(HystrixCommandProperties.ExecutionIsolationStrategy.THREAD) //hystrix默认是线程池隔离
                                                    //降级策略
                                                    .withFallbackEnabled(true)
                                                    .withFallbackIsolationSemaphoreMaxConcurrentRequests(100) //降级逻辑的最大并发度，默认10
                                                    //执行超时策略
                                                    .withExecutionTimeoutEnabled(true)
                                                    .withExecutionTimeoutInMilliseconds(3000) //command执行的超时时间，默认1s
                                                    .withExecutionIsolationThreadInterruptOnTimeout(true) //超时到了是否运行中断线程
                                                    .withExecutionIsolationThreadInterruptOnFutureCancel(true)//异步执行的时候是否可以通过Future.cancel()中断线程，default false
                                                    //熔断
                                                    .withCircuitBreakerEnabled(true) //启用熔断，默认true
                                                    .withCircuitBreakerForceClosed(false) //强制关闭熔断开关，默认false
                                                    .withCircuitBreakerForceOpen(false) //强制打开熔断开关，默认false
                                                    .withCircuitBreakerErrorThresholdPercentage(100) //一个采样周期内（默认10s）失败率超过这个值将会打开熔断开关，默认50%
                                                    .withCircuitBreakerRequestVolumeThreshold(20)  //一个采样周期内达到这个请求数才进行失败百分比判定熔断，默认20
                                                    .withCircuitBreakerSleepWindowInMilliseconds(30000)  //一旦熔断后，每隔这个时间窗口允许一次重试，成功则关闭熔断开关，否则继续打开，默认5s
                                                    ;
        //组装成HystrixCommand.Setter返回
        return HystrixCommand.Setter.withGroupKey(groupKey)
                                    .andCommandKey(commandKey)
                                    .andThreadPoolKey(threadPoolKey)
                                    .andThreadPoolPropertiesDefaults(threadpoolPropertiesSetter)
                                    .andCommandPropertiesDefaults(commandPropertiesSetter);
    }

    @Override
    protected String run() throws Exception{
        logger.info("hystrix-pool线程执行command.run...");
        return this.serversideService.getStock();
    }
    
    @Override
    protected String getFallback() {
        logger.info("hystrix-timer线程执行降级逻辑");
        return "response from backend service（降级）";
    }
}

其中熔断相关的几个配置的说明：

withCircuitBreakerEnabled 默认启用熔断
withCircuitBreakerForceClosed、withCircuitBreakerForceOpen强制关闭和强制打开熔断开关默认都是false
withCircuitBreakerErrorThresholdPercentage 一个采样周期默认10秒内，失败率超过这个百分比则熔断。默认是50%
withCircuitBreakerRequestVolumeThreshold 采样周期内必须有默认20个请求才会按失败率判定是否熔断
withCircuitBreakerSleepWindowInMilliseconds 熔断以后默认每5s放一次请求尝试执行，如果成功则关闭熔断开关、如果失败则仍处于熔断开关打开状态。这种状态也成为half-open，“熔断开关打开了但没完全打开，每隔5s放一个过去”

如果您还没理解熔断或者说fail-fast快速失败的意义，那么我这里有个例子：
假设说调用一个依赖的服务接口，正常情况下50ms返回，调用超时时间设置为2s、ok，这已经够短了吧。如果200个worker线程满并发工作的情况下，先不考虑诸如cpu切换损耗的时间，那么理想情况下这个调用的tps应该是4000，但如果这个依赖的接口挂了，每次调用都要等2s返回超时，嗯，这次相当于tps只有100了，看到了吧，差距就是这么大。
足以见得fail-fast快速失败是多么的重要。

4、为什么涉及远程调用的时候不推荐使用hystrix信号量隔离

用实验来说明一下：

先试试使用线程隔离HystrixCommandProperties.ExecutionIsolationStrategy.THREAD模式，在3s到了以后，hystrix的timer会中断隔离线程，然后接着timer线程去执行降级逻辑。client日志：

[2021-09-15 22:19:19] [ INFO ] [http-nio-8080-exec-3] [traceId:825ac8dfe175438a8850d488797bd3b9] com.hfi.controller.WanganController [codeline:33] - 开始调用controller, getStock
[2021-09-15 22:19:19] [ INFO ] [hystrix-serverside-pool-2] [traceId:825ac8dfe175438a8850d488797bd3b9] com.hfi.service.ServersideServiceCommand [codeline:66] - hystrix-pool线程执行command.run...
[2021-09-15 22:19:19] [ INFO ] [hystrix-serverside-pool-2] [traceId:825ac8dfe175438a8850d488797bd3b9] com.hfi.service.ServerSideService [codeline:49] - 执行service, getStock...
[2021-09-15 22:19:22] [ INFO ] [HystrixTimer-1] [traceId:825ac8dfe175438a8850d488797bd3b9] com.hfi.service.ServersideServiceCommand [codeline:72] - hystrix-timer线程执行降级逻辑
[2021-09-15 22:19:22] [ INFO ] [http-nio-8080-exec-3] [traceId:825ac8dfe175438a8850d488797bd3b9] com.hfi.controller.WanganController [codeline:36] - controller返回response from backend service（降级）

22:19:19 tomcat worker线程执行controller
22:19:19 hystrix隔离线程执行command、以及被包装的业务方法getStock
22:19:22 也就是3秒后，HystrixTimer监听到超时，开始执行降级逻辑
22:19:22 worker线程从接口调用中返回，并返回给前端postman降级response

然后改为使用HystrixCommandProperties.ExecutionIsolationStrategy.SEMAPHORE信号量模式，现象就有意思了。先看下client端的日志：

[2021-09-15 21:52:51] [ INFO ] [http-nio-8080-exec-5] [traceId:da11e88e8efd48f39d397431e6c26058] com.hfi.controller.WanganController [codeline:33] - 开始调用controller, getStock
[2021-09-15 21:52:51] [ INFO ] [http-nio-8080-exec-5] [traceId:da11e88e8efd48f39d397431e6c26058] com.hfi.service.ServersideServiceCommand [codeline:66] - 执行command.run...
[2021-09-15 21:52:51] [ INFO ] [http-nio-8080-exec-5] [traceId:da11e88e8efd48f39d397431e6c26058] com.hfi.service.ServerSideService [codeline:49] - 执行service, getStock...
[2021-09-15 21:52:54] [ INFO ] [HystrixTimer-1] [traceId:da11e88e8efd48f39d397431e6c26058] com.hfi.service.ServersideServiceCommand [codeline:72] - hystrix-timer线程执行降级逻辑
[2021-09-15 21:53:46] [ INFO ] [http-nio-8080-exec-5] [traceId:da11e88e8efd48f39d397431e6c26058] com.hfi.controller.WanganController [codeline:36] - controller返回response from backend service（降级）

21:52:51 worker线程执行controller，command，以及被装饰的业务方法getStock
21:52:54 也就是3秒之后，HystrixTimer线程监听到超时，开始执行降级逻辑。但并没有中断，因为接口逻辑是worker线程在执行而不是像线程池隔离模式那样是hystrix-pool线程。
21:53:46 也就是从21:52:51开始算的55秒之后，worker线程也就是主线程从接口服务中返回（等了55秒），然后返回降级结果给前端。