【机器学习实战】决策树Python实现-CFANZ编程社区

【机器学习实战】决策树Python实现_python

文章目录

决策树的构造
测试分类器

决策树的构造

决策树

优点: 计算复杂度不高, 输出结果易于理解, 对中间值的缺失不敏感, 可以处理不相关特征数据。
缺点: 可能会产生过度匹配问题。适用数据类型: 数值型和标称型。

创建分支的伪代码函数createBranch () 如下所示:

检测数据集中的每个子项是否属于同一分类:

If so return 类标签;

Else

寻找划分数据集的最好特征

划分数据集

创建分支节点

for 每个划分的子集

    调用函数createBranch并增加返回结果到分支节点中 
    
return 分支节点

上面的伪代码createBranch是一个递归函数, 在倒数第二行直接调用了它自己。

决策树的一般流程

收集数据: 可以使用任何方法。
准备数据: 树构造算法只适用于标称型数据，因此数值型数据必须离散化。
分析数据: 可以使用任何方法, 构造树完成之后, 我们应该检查图形是否符合预期。
训练算法: 构造树的数据结构。
测试算法: 使用经验树计算错误率。
使用算法: 此步骤可以适用于任何监督学习算法, 而使用决策树可以更好地理解数据的内在含义。

知识补充：信息论
熵定义为信息的期望值, 那么，什么是信息呢，下面给出信息的定义。如果待分类的事务可能划分在多个分类之中, 则符号【机器学习实战】决策树Python实现_深度学习_02 的信息定义为
【机器学习实战】决策树Python实现_决策树_03
其中【机器学习实战】决策树Python实现_ID3_04 是选择该分类的概率。
为了计算熵, 需要计算所有类别所有可能值包含的信息期望值, 通过下面的公式得到:
【机器学习实战】决策树Python实现_机器学习_05 其中【机器学习实战】决策树Python实现_深度学习_06

# 计算指定数据集的熵
from math import log

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet:
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2)
    return

def createDataSet():
    dataSet = [[1,1,'yes'],
              [1,1,'yes'],
              [1,0,'no'],
              [0,1,'no'],
              [0,1,'no']]
    labels = ['no surfacing', 'flippers']
    return dataSet,

myDat,labels = createDataSet()

myDat

[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

labels

['no surfacing', 'flippers']

calcShannonEnt(myDat)

0.9709505944546686

myDat[0][-1] = 'maybe'

myDat

[[1, 1, 'maybe'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

calcShannonEnt(myDat)

1.3709505944546687

以上说明，熵越高，混合的数据越多

# 按照给定特征划分数据集
def splitDataSet(dataSet,axis,value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reduceFeatVec = featVec[:axis] #剔除axis特征
            reduceFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reduceFeatVec)
    return

myDat[0][-1] = 'yes' #恢复数据集

myDat

[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

splitDataSet(myDat,0,1)

[[1, 'yes'], [1, 'yes'], [0, 'no']]

splitDataSet(myDat,0,0)

[[1, 'no'], [1, 'no']]

接下来遍历整个数据集, 循环计算香农熵和 splitDataSet () 函数, 找到最好的特征划分方式。熵计算将会告诉我们如何划分数据集是最好的数据组织方式。

Information gain
Information gain 【机器学习实战】决策树Python实现_python_07 is the measure of the difference in entropy from before to after the set 【机器学习实战】决策树Python实现_深度学习_08 is split on an attribute 【机器学习实战】决策树Python实现_深度学习_09 . In other words, how much uncertainty in was reduced after splitting set on attribute .
【机器学习实战】决策树Python实现_ID3_13
Where,

- Entropy of set
- The subsets created from splitting setby attributesuch that
- The proportion of the number of elements into the number of elements in set
- Entropy of subset

#选择最好的数据集划分方式
def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet,i,value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        if (infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return

chooseBestFeatureToSplit(myDat)

myDat

[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']]

len(myDat[0])

import

#决定叶子节点分类
def majorityCnt(classList):
    classCount = {}
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote] = 0
            classCount[vote] += 1
    sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    if len(dataSet[0]) == 1:#遍历完所有特征，仍然不能将数据集划分成包含唯一类别的分组
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
    return

myTree = createTree(myDat,labels)

[1, 1, 'yes']
[1, 'yes']





{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

测试分类器

def classify(inputTree,featLabels,testVec):
    firstStr = list(inputTree.keys())[0]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    for key in secondDict.keys():
        if testVec[featIndex] == key:
            if type(secondDict[key]).__name__ == 'dict':
                classLabel = classify(secondDict[key],featLabels,testVec)
            else:
                classLabel = secondDict[key]
    return

myDat,labels = createDataSet()

labels

['no surfacing', 'flippers']

myTree

{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

classify(myTree,labels,[1,0])

'no'

classify(myTree,labels,[1,1])

'yes'