老师让我用一些机器学习的算法对数据进行一个分类,下面是一些需要用到的基础代码,并不包括针对项目的模型处理和修改,留作记忆学习。
对于数据划分训练集直接省略
def Tree_score(depth = 3,criterion = 'entropy',samples_split=2):
#构建树
tree = DecisionTreeClassifier(criterion = criterion,max_depth = depth,min_samples_split=samples_split)
#训练树
tree.fit(Xtrain, Ytrain)
#训练集和测试集精确度得分
train_score = tree.score(Xtrain, Ytrain)
test_score = tree.score(Xtest, Ytest)
#return train_score,test_score
下面是对于树的得分曲线绘制,是可以作图观察最优的参数,也是百度了一会的,参考页面忘记了
p,k=0
def tree_best_plot(picture_path):
global p,k
depths = range(2,25)
#先是考虑用gini,考虑不同的深度depth
scores = [Tree_score(d,'gini') for d in depths]
train_scores = [s[0] for s in scores]
test_scores = [s[1] for s in scores]
plt.figure(figsize = (6,6),dpi = 144)
plt.grid()
plt.xlabel("max_depth of decision Tree")
plt.ylabel("score")
plt.title("'gini'")
plt.plot(depths,train_scores,'.g-',label = 'training score')
plt.plot(depths,test_scores,'.r--',label = 'testing score')
plt.legend()
path=picture_path+'gini_'+str(k)+'.jpg'
k+=1
plt.savefig(path, bbox_inches='tight', dpi=450)
#信息熵(entropy),深度对模型精度的影响
scores = [Tree_score(d) for d in depths]
train_scores = [s[0] for s in scores]
test_scores = [s[1] for s in scores]
plt.figure(figsize = (6,6),dpi = 144)
plt.grid()
plt.xlabel("max_depth of decision Tree")
plt.ylabel("score")
plt.title("'entropy'")
plt.plot(depths,train_scores,'.g-',label = 'training score')
plt.plot(depths,test_scores,'.r--',label = 'testing score')
plt.legend()
path=picture_path+'entropy'+str(p)+'.jpg'
plt.savefig(path, bbox_inches='tight', dpi=450)
p+=1
但是也有利用函数自动调整参数的方法,使用包GridSearchCV,param里是选择的元素集合
from sklearn.model_selection import GridSearchCV
#为了节省时间 去掉gini的检查
param = {'criterion':['entropy','gini'],'max_depth':[2,3,4,5,6,7],'min_samples_leaf':[2,3,4,5,6,7],'min_impurity_decrease':[0.1,0.2,0.3,0.5],'min_samples_split':[2,3,4,5,6,7,8]}
grid = GridSearchCV(DecisionTreeClassifier(),param_grid=param,cv=5)
#用数据进行训练
grid.fit(Xtrain,Ytrain)
print('最优分类器:',grid.best_params_,'最优分数:', grid.best_score_) # 得到最优的参数和分值
训练好的数据需要保存为模型的形式:
joblib.dump(clf,'predictor.pkl')
模型的加载:model=joblib.load('predictor.pkl')
如果直接封装的就是sklearn中的模型,可以直接调用model.predict和model.score
##返回精确度
score1=model.score(Xtrain, Ytrain)
score2 = model.score(Xtest, Ytest)
这个得到的是预测的概率值:
y1=model.predict_proba(Xtest)