Python 随机森林特征重要度-CFANZ编程社区

Python 随机森林特征重要度

1 声明

本文的数据来自网络，部分代码也有所参照，这里做了注释和延伸，旨在技术交流，如有冒犯之处请联系博主及时处理。

2 随机森林特征重要度简介

决策树的优点是通过树形结构以规则的形式查看模型的内在结构，但随机森林是由几十、上百甚至上千棵决策树组成的，这样很难再可视化查看模型的结构。但是我们可以通过随机森林查看特征的重要度。

关于特征的重要性，需要注意两点：

第一点scikit-learn要求我们将名义分类特征分解为多个二元特征（一种名义变量转化为数值型的常见方法 One-Hot编码）；第二点如果两个特征高度相关，则会考虑其中一个特征，另外个特征将被弱化，如果不这么处理模型将难以解释。

在scikit-learn中，分类回归决策树和随机森林可以使用特征重要性方法来查看每个特征的相对重要性。

通过特征重要性筛选的步骤：

第一步随机森林用到所有特征建立模型，此时会计算出特征的重要性并形成特征矩阵，第二步对该该矩阵通过SelectFromModel的threshold阈值参数进行过滤，用这个模型作为最终的模型。

3 随机森林特征重要度代码示例

# 导入相关库和包
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
# 装载数据
iris = datasets.load_iris()
features = iris.data
target = iris.target
# 创建随机森林模型并计算特征重要度
randomforest = RandomForestClassifier(random_state=0, n_jobs=-1)
model = randomforest.fit(features, target)
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]
names = [iris.feature_names[i] for i in indices]
#print(names)
#print(range(features.shape[1]), importances[indices])
# 画图
plt.figure()
from matplotlib.font_manager import FontProperties
#设置支持中文字体
fp= FontProperties(fname="c:/windows/fonts/simsun.ttc", size=12)
plt.suptitle('特征重要性',fontproperties=fp)
plt.bar(range(features.shape[1]), importances[indices])
plt.xticks(range(features.shape[1]), names, rotation=90)
plt.show()
# 通过重要度的阈值筛选特征
# 定义重要度的阈值
selector = SelectFromModel(randomforest, threshold=0.3)
features_important = selector.fit_transform(features, target)
# 训练新的模型
model = randomforest.fit(features_important, target)

4 总结

无