【机器学习-分类】决策树预测-CFANZ编程社区

老师让我用一些机器学习的算法对数据进行一个分类，下面是一些需要用到的基础代码，并不包括针对项目的模型处理和修改，留作记忆学习。

对于数据划分训练集直接省略

def Tree_score(depth = 3,criterion = 'entropy',samples_split=2):
	#构建树
	tree = DecisionTreeClassifier(criterion = criterion,max_depth = depth,min_samples_split=samples_split)
	#训练树
	tree.fit(Xtrain, Ytrain)
	#训练集和测试集精确度得分
	train_score = tree.score(Xtrain, Ytrain)
	test_score = tree.score(Xtest, Ytest)
	#return train_score,test_score

下面是对于树的得分曲线绘制，是可以作图观察最优的参数，也是百度了一会的，参考页面忘记了

p,k=0
def tree_best_plot(picture_path):
    global p,k
    depths = range(2,25)
    #先是考虑用gini，考虑不同的深度depth
    scores = [Tree_score(d,'gini') for d in depths]
    train_scores = [s[0] for s in scores]
    test_scores = [s[1] for s in scores]

    plt.figure(figsize = (6,6),dpi = 144)
    plt.grid()
    plt.xlabel("max_depth of decision Tree")
    plt.ylabel("score")
    plt.title("'gini'")
    plt.plot(depths,train_scores,'.g-',label = 'training score')
    plt.plot(depths,test_scores,'.r--',label = 'testing score')
    plt.legend()
    path=picture_path+'gini_'+str(k)+'.jpg'
    k+=1
    plt.savefig(path, bbox_inches='tight', dpi=450)
    
    #信息熵(entropy)，深度对模型精度的影响
    scores = [Tree_score(d) for d in depths]
    train_scores = [s[0] for s in scores]
    test_scores = [s[1] for s in scores]
    plt.figure(figsize = (6,6),dpi = 144)
    plt.grid()
    plt.xlabel("max_depth of decision Tree")
    plt.ylabel("score")
    plt.title("'entropy'")
    plt.plot(depths,train_scores,'.g-',label = 'training score')
    plt.plot(depths,test_scores,'.r--',label = 'testing score')
    plt.legend()
    path=picture_path+'entropy'+str(p)+'.jpg'
    plt.savefig(path, bbox_inches='tight', dpi=450)
    p+=1

但是也有利用函数自动调整参数的方法，使用包GridSearchCV，param里是选择的元素集合

from sklearn.model_selection import GridSearchCV
#为了节省时间  去掉gini的检查
param = {'criterion':['entropy','gini'],'max_depth':[2,3,4,5,6,7],'min_samples_leaf':[2,3,4,5,6,7],'min_impurity_decrease':[0.1,0.2,0.3,0.5],'min_samples_split':[2,3,4,5,6,7,8]}
grid = GridSearchCV(DecisionTreeClassifier(),param_grid=param,cv=5)
#用数据进行训练
grid.fit(Xtrain,Ytrain)
print('最优分类器:',grid.best_params_,'最优分数:', grid.best_score_)  # 得到最优的参数和分值

训练好的数据需要保存为模型的形式：

joblib.dump(clf,'predictor.pkl')

模型的加载：model=joblib.load('predictor.pkl')

如果直接封装的就是sklearn中的模型，可以直接调用model.predict和model.score

##返回精确度
score1=model.score(Xtrain, Ytrain) 
score2 = model.score(Xtest, Ytest)

这个得到的是预测的概率值：

y1=model.predict_proba(Xtest)