Course2-Week4-决策树-CFANZ编程社区

Course2-Week4-决策树

文章目录

Course2-Week4-决策树

1. 决策树的直观理解

“神经网络”和“决策树/决策树集合”都被广泛应用于很多商业应用，并且在各类机器学习比赛中也取得了很好的成绩。但相比于“神经网络”，“决策树/决策树集合”却并没有在学术界引起很多关注。本周就来介绍“决策树/决策树集合”这个非常强大的工具。

“决策树(decision tree)”采用“二叉树”的结构，从根节点(root node)开始，不断地进行一系列判断，最终到达叶子节点(leaf node)，也就是预测结果。上面的“识别猫”问题将贯穿本周，用于帮助我们理解“决策树”中的概念。比如针对“识别猫”的数据集，我们可以构建如下所示的“决策树”，此时有一个新的输入，就可以按照该“决策树”进行推理：

图2-4-1 构建“决策树”并使用“决策树”进行推理

可以看到按照“决策树”的判断过程，这个新的输入最终被预测为猫，符合预期。但显然，对于当前给定的训练集，若没有其他约束条件的话，上述问题的决策树显然不止一种，如下图：

图2-4-2 各种各样的决策树

在上述这些不同的决策树当中，某些性能很好、某些性能很差。所以“决策树学习算法”就是，在所有可能的决策树中，找出当前数据集所对应的性能最好的决策树模型。下面是我们需要考虑的两条准则：

2. 构建单个决策树

2.1 熵和信息增益

熵(杂质)

前面提到希望每次在节点进行拆分时，都尽可能的降低信息的“不确定度”，也就是尽可能提升信息的“纯度(purity)”。这个“不确定度”使用“熵(entropy)”来进行计算，下面是“熵”的计算公式和曲线。由于“熵”的物理意义就是“信息的不确定度/不纯的程度”，所以机器学习中又喜欢称“熵”为“杂质(impurity)”。这些都只是花里胡哨的名字而已，只需要记住：

$\begin{aligned} \text{Entropy:} \quad H(P) = -\sum_{all\;i}p_ilog_2(p_i) \overset{二元分类}{\Longrightarrow} H(p_1) = -p_1log_2(p_1) - (1-p_1)log_2(1-p_1) \end{aligned}$

图2-4-3 “熵”的示意图

注意，除了“熵”之外，还有其他方法来衡量“信息不纯的程度”，比如开源软件包中会提供“Gini指数”。它表示从数据集中随机选择两个样本，其类别标签不一致的概率，下面是其计算公式。但是为了首先掌握决策树的构建过程，而不是让过多的琐碎的概念困扰，本课程我们就只用“熵”来表示信息的“不纯度”。
$\text{Gini index:} \quad G(P) = 1 -\sum_{all\;i}p_i^2 \overset{二元分类}{\Longrightarrow} G(p_1) = 1- p_1^2 - (1-p_1)^2 = 2p_1(1-p_1)$

信息增益

每次拆分都应最大程度的减少信息的不确定度，也就是“熵”，而减少的“熵”的大小则被称为“信息增益”。通信人表示，这些又双叒叕是花里胡哨的名字，只要记住 “信息增益”就是拆分前后减少的不确定度。注意拆分后信息的不确定度，应该为两分支的熵的加权平均，权值就是拆分后各分支的子集大小占拆分前的集合大小的比例。下面给出计算公式：

$\text{Information Gain} = H(p_1^{\text{root}}) - \left(w^{\text{left}} H(p_1^{\text{left}}) + w^{\text{right}} H(p_1^{\text{right}})\right)$

图2-4-4 选择根节点——三个特征的信息增益计算

显然，在三种“二元输入特征”中，“Ear shape”的“信息增益”最大，所以选为根节点。

2.2 构建决策树——二元输入特征

注1：停止拆分是为了保证树不会变的太大、太笨重，进而降低过拟合的风险。
注2：上述算法可以使用“递归”。
注3：在构建过程中，左右分支可能会选取相同的特征。

图2-4-5 构建决策树的过程——“识别猫”问题

2.3 构建决策树——多元输入特征

若想针对“多元输入特征”构建决策树，可能会有如下针对“二元输入特征”决策树的改进思路：

比如下面的“耳朵形状(Ear shape)”有三种可能取值“尖的(Pointy)”、“松软的(Floppy)”、“椭圆形的(Oval)”，将其转换成独热码后，相当于将1个“三元输入特征”转化成3个“二元输入特征”，于是我们只需要对训练集进行一小步预处理，即可复用上述“二元输入特征”的思路：

图2-4-6 将“多元特征”转换成“独热码”

2.4 构建决策树——连续的输入特征

和上一小节类似，也是将“连续输入特征”转换成“二元输入特征”，然后继续进行构建。但不同的是，“多元特征”只需在最开始进行预处理，而在没被当前所在分支使用前，“连续输入特征”需要在每个节点都进行一次计算。具体来说，就是选择一个阈值，依照该阈值对当前节点的集合进行拆分，可以使“信息增益”最大，不同节点所计算的阈值可能不同。于是，就可以将“连续输入特征”转换成判断大小的“二元输入特征”。比如在“识别猫”问题中，引入“重量”这一连续取值的输入特征，由于选取“9”作为阈值可以使“信息增益”最大，于是便将“重量”这一“连续输入特征”，转换成“是否小于等于9磅？”这个“二元输入特征”：

图2-4-7 利用“信息增益”将“连续特征”转换成“判断”

2.5 构建回归树——连续的输出结果(选修)

本小节将“决策树(decision trees)”算法推广到“回归树(regression trees)”，也就是将“决策树”的预测结果扩展到连续的取值。和之前的“拆分后尽可能减少信息的不确定度”类似，“回归树”使用“方差(variance)”来衡量信息的不确定度。于是，拆分后的方差为左右两分支的方差的加权平均，权值也是左右分支的子集大小占拆分前集合大小的比例。相应的“信息增益”为：

$\text{Information gain} = V(s^{\text{root}}) - \left(w^{\text{left}} V(s^{\text{left}}) + w^{\text{right}} V(s^{\text{right}})\right)$

回到“识别猫”问题，现在将“重量”作为需要预测的结果(如下左图)。于是，在每次进行拆分时，就使用“方差”来计算“信息增益”，并选择“信息增益”最大的特征作为当前节点的拆分标准。最后达到终止拆分条件，“决策树”构建完成时，直接使用最终分类的均值作为以后的预测值：

图2-4-8 每次拆分都尽可能的减少方差

图2-4-9 使用均值作为当前分支的预测结果

2.6 代码实现-递归构建单个决策树

最后一个小节来使用代码构建单个决策树。注意本练习完全手敲实现前面几节的原理，不调用任何封装好的机器学习库函数，问题要求和代码结构如下：

表2-4-1 数据集

Cap Color 伞盖颜色	Stalk Shape 茎秆形状	Solitary 独株？	Edible 可食用？
Brown	Tapering	Yes	1
Brown	Enlarging	Yes	1
Brown	Enlarging	No	0
Brown	Enlarging	No	0
Brown	Tapering	Yes	1
Red	Tapering	Yes	0
Red	Enlarging	No	0
Brown	Enlarging	Yes	1
Red	Tapering	No	1
Brown	Enlarging	No	0

下面是Python代码和打印输出的结果：

import numpy as np
#################################################################################
# 函数1：计算01序列的熵
def compute_entropy(y):
    """
    Computes the entropy for 
    
    Args:
       y (ndarray): Numpy array indicating whether each example at a node is
           edible (`1`) or poisonous (`0`)
       
    Returns:
        entropy (float): Entropy at that node
        
    """
    # 排除特殊情况
    if(len(y)==0):
        return 0.0
    
    # 正常计算
    p1 = y.sum()/y.size
    # print(p1)
    if(p1==0 or p1==1):
        return 0.0
    else:
        entropy = -p1*np.log2(p1)-(1-p1)*np.log2(1-p1)
        return entropy

#################################################################################
# 函数2：按照给定特征分割，返回左右子节点的列表
def split_dataset(X, node_indices, feature):
    """
    Splits the data at the given node into left and right branches
    
    Args:
        X (ndarray):             Data matrix of shape(n_samples, n_features)
        node_indices (ndarray):  List containing the active indices. I.e, the samples being considered at this step.
        feature (int):           Index of feature to split on
    
    Returns:
        left_indices (ndarray): Indices with feature value == 1
        right_indices (ndarray): Indices with feature value == 0
    """
    # 定义列表
    left_indices = []
    right_indices = []
    # 按照给定特征分割
    for i in node_indices:
        if(X[i][feature]):
            left_indices.append(i)
        else:
            right_indices.append(i)
    # 返回左右列表
    return left_indices, right_indices

#################################################################################
# 函数3：计算信息增益
def compute_information_gain(X, y, node_indices, feature):
    
    """
    Compute the information of splitting the node on a given feature
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
        feature (int):          Index of feature to split on
   
    Returns:
        cost (float):        Cost computed
    
    """
    # 排除意外情况
    if(len(node_indices)==0):
        return 0.0
    # Split dataset
    left_indices, right_indices = split_dataset(X, node_indices, feature)
    # root entropy
    H_root = compute_entropy(y[node_indices])
    # Weights 
    w_left = len(left_indices) / len(node_indices)
    w_right = len(right_indices) / len(node_indices)
    # Weighted entropy
    H_left = compute_entropy(y[left_indices])
    H_right = compute_entropy(y[right_indices])
    #Information gain                                                   
    information_gain = H_root - (w_left*H_left + w_right*H_right)    
    return information_gain

#################################################################################
# 函数4：找到信息增益最大的特征
def get_best_split(X, y, node_indices):   
    """
    Returns the optimal feature and threshold value to split the node data 
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.

    Returns:
        best_feature (int):     The index of the best feature to split
    """    
    
    best_feature = -1           # 最佳的拆分特征
    info_gain = np.array([])    # 所有剩余特征的信息增益
    num_features = X.shape[1]   # 特征总数
    # 遍历计算所有特征对应的信息增益
    for i in range(num_features):
        info_gain = np.append(info_gain, compute_information_gain(X, y, node_indices, i))
    # 找到最大的信息增益并返回
    if(info_gain.max() != 0):
        best_feature = info_gain.argmax()
    return best_feature

#################################################################################
# 函数5：递归的构建决策树
def build_tree_recursive(X, y, node_indices, branch_name, max_depth, current_depth):
    """
    Build a tree using the recursive algorithm that split the dataset into 2 subgroups at each node.
    This function just prints the tree.
    
    Args:
        X (ndarray):            Data matrix of shape(n_samples, n_features)
        y (array like):         list or ndarray with n_samples containing the target variable
        node_indices (ndarray): List containing the active indices. I.e, the samples being considered in this step.
        branch_name (string):   Name of the branch. ['Root', 'Left', 'Right']
        max_depth (int):        Max depth of the resulting tree. 
        current_depth (int):    Current depth. Parameter used during recursive call.
    """ 

    # Maximum depth reached - stop splitting
    if current_depth == max_depth:
        formatting = " "*current_depth + "-"*current_depth
        print(formatting, "%s leaf node with indices" % branch_name, node_indices)
        return
   
    # Otherwise, get best split and split the data
    # Get the best feature and threshold at this node
    best_feature = get_best_split(X, y, node_indices)
    tree = []
    tree.append((current_depth, branch_name, best_feature, node_indices))
    
    formatting = "-"*current_depth
    print("%s Depth %d, %s: Split on feature: %d" % (formatting, current_depth, branch_name, best_feature))
    
    # Split the dataset at the best feature
    left_indices, right_indices = split_dataset(X, node_indices, best_feature)
    
    # continue splitting the left and the right child. Increment current depth
    build_tree_recursive(X, y, left_indices, "Left", max_depth, current_depth+1)
    build_tree_recursive(X, y, right_indices, "Right", max_depth, current_depth+1)

####################################主函数######################################
# 定义训练集
X_train = np.array([[1,1,1],[1,0,1],[1,0,0],[1,0,0],[1,1,1],[0,1,1],[0,0,0],[1,0,1],[0,1,0],[1,0,0]])
y_train = np.array([1,1,0,0,1,0,0,1,1,0])
# print ('The shape of X_train is:', X_train.shape)
# print ('The shape of y_train is: ', y_train.shape)
# print ('Number of training examples (m):', len(X_train))
# 有效样本的索引
root_indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] # 全部包括则表示全有效
# 递归构建决策树
build_tree_recursive(X_train, y_train, root_indices, "Root", max_depth=2, current_depth=0)

# # 测试函数：计算熵
# print("\n测试函数：计算熵")
# print("Entropy at root node: ", compute_entropy(y_train)) 
# # 测试函数：给定特征拆分
# print("\n测试函数：给定特征拆分")
# feature = 0
# left_indices, right_indices = split_dataset(X_train, root_indices, feature)
# print("Left indices: ", left_indices)
# print("Right indices: ", right_indices)
# # 测试函数：给定特征，计算信息增益
# print("\n测试函数：给定特征，计算信息增益")
# info_gain0 = compute_information_gain(X_train, y_train, root_indices, feature=0)
# print("Information Gain from splitting the root on brown cap: ", info_gain0)
# info_gain1 = compute_information_gain(X_train, y_train, root_indices, feature=1)
# print("Information Gain from splitting the root on tapering stalk shape: ", info_gain1)
# info_gain2 = compute_information_gain(X_train, y_train, root_indices, feature=2)
# print("Information Gain from splitting the root on solitary: ", info_gain2)
# # 测试函数：计算信息增益最大的特征
# print("\n测试函数：计算信息增益最大的特征")
# best_feature = get_best_split(X_train, y_train, root_indices)
# print("Best feature to split on: %d" % best_feature)

 Depth 0, Root: Split on feature: 2
- Depth 1, Left: Split on feature: 0
  -- Left leaf node with indices [0, 1, 4, 7]
  -- Right leaf node with indices [5]
- Depth 1, Right: Split on feature: 1
  -- Left leaf node with indices [8]
  -- Right leaf node with indices [2, 3, 6, 9]

测试函数：计算熵
Entropy at root node:  1.0

测试函数：给定特征拆分
Left indices:  [0, 1, 2, 3, 4, 7, 9]
Right indices:  [5, 6, 8]

测试函数：给定特征，计算信息增益
Information Gain from splitting the root on brown cap:  0.034851554559677034
Information Gain from splitting the root on tapering stalk shape:  0.12451124978365313
Information Gain from splitting the root on solitary:  0.2780719051126377

测试函数：计算信息增益最大的特征
Best feature to split on: 2

3. 决策树集合

上一节已经详细的讨论了如何构建单个“决策树”。事实上，如果训练很多“决策树”组成“决策树集合(dicision tree ensemble)”，那么会得到更准确、更稳定的预测结果。下面就来介绍几种构建“决策树集合”的方法。

3.1 使用决策树集合

使用单个决策树完成任务时，有一个很大的缺点：单个决策树对于训练集的微小变化非常敏感。比如下图中，只是替换了训练集中的单个样本，就导致训练出完全不一样的决策树：

图2-4-10 单个决策树对于训练集敏感

对于一个新的输入，不同的决策树很可能会有不同的预测结果。于是为了使算法更强壮，我们就需要创建不同的训练集，构建出不同的决策树组成“决策树集合(dicision tree ensemble)”。对于新的输入，使用这个“决策树集合”对所有输出结果进行投票，选择最有可能的结果，于是就可以降低算法对于单个决策树的依赖程度，也就可以降低了对于数据的敏感程度：

图2-4-11 使用“决策树集合”进行投票

下面三小节就来介绍三种常见的构建“决策树集合”的方法，主要区别在于单个决策树的训练集选择策略不同。

3.2 袋装决策树

最简单的构建不同训练集的方法，就是“有放回抽样(sampling with replacement)”。假设原始训练集大小为 $m$ ，每次训练前都随机地有放回抽取 $m$ 次样本，作为本次的训练集：

图2-4-12 构建单次训练集——有放回抽样

于是我们便可以创建出，有微小不同的多个训练集。注意到，单次的抽取结果中，可以有重复的样本。上面这种方法就称为“袋装决策树(bagged decision tree)”：

3.3 随机森林

“袋装决策树”有个缺点，就是很多决策树在根节点或者根节点附近的区域都非常相似。于是为了避免这个问题，在上述“袋装决策树”的基础上，训练单个决策树时，对于每个节点都会从 $n$ 个特征中随机选取 $k$ 个特征组成子集，然后在这个子集中选取最大的“信息增益”( $k < n$ )。一般来说，都会取 $k=\sqrt{n}$ 。于是，每次的训练集都是随机选取的，单个决策树的每个节点特征都是从随机选取的子集中选取的，这便称为“随机森林(random forest)”。
正是这些由“随机选取”产生的微小变动组合在一起，使得“随机森林”比单个决策树更加健壮，于是训练集的任何微小变化都不会对随机森林的输出有太大的影响。

3.4 XGBoost算法

下面介绍比“随机森林(Random forest)”更强的算法——梯度提升树(Gradient boost tree)。每次抽取新的训练集时，都以更高概率选择上一次决策树训练出错的样本。这就是“增强(boosting)”的含义，类似于“刻意练习”。具体的增加多少概率的数学过程是非常复杂的，本课程不过多讨论，会用就行。“XGBoost(eXtreme Gradient Boosting, 极限梯度提升算法)”就是“梯度提升树”的一种，下面是其特点：

多年以来，研究人员提出了很多构建决策树、选取决策树样本的方法。迄今为止，构建决策树集合最常用的方法就是“XGBoost算法”。它运行速度快、开源、易于使用，在很多机器学习比赛、商业应用中广泛使用。XGBoost的内部实现原理非常复杂，所以大家都是直接调用封装好的XGBoost库：

# 分类问题
from xgboost import XGBClassfier
model = XGBClassfier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# 回归问题
from xgboost import XGBRegressor
model = XGBRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

3.5 何时使用决策树

本周最后一小节来总结一下“决策树/决策树集合”、“神经网络”各自的适用场景：