Caret模型训练和调参更多参数解读(2)-CFANZ编程社区

trainControl函数控制参数

trainControl函数用于定义train函数运行的一些参数，如交叉验证方式、模型评估函数、模型选择标准、调参方式等。

部分参数解释如下：

method: 重采样方法”boot”, “cv”, “LOOCV”, “LGOCV”, “repeatedcv”, “timeslice”, “none” and “oob”。
oob只适用于random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, 或 conditional tree forest models. ‘none’表示不进行重采样。
repeatedcv是K-fold交叉验证重复多次。
number: K-fold中的K或boot中的迭代次数。
repeats: 只适用于repeatedcv，重复K-fold`交叉验证数次
method = "repeatedcv", number = 10 和 repeats = 3表示完整重复3次10-fold交叉验证。
search：
选项是grid (矩阵调参)或random (随机调参)
summaryFunction: 计算模型性能矩阵的函数,默认是defaultSummary；
twoClassSummary用来计算敏感性、特异性和AUC。
sampling: 在数据不平衡时的抽平方式。
down-sampling把所有类的样品抽取到与最小的类样品数一致；
up-sampling有放回的抽取使得所有类的样品与最大的类样品数一致。
实际上是否随机采样和用什么采样方法获得的结果一般是一致的。
https://topepo.github.io/caret/subsampling-for-class-imbalances.html
classProbs: 是(TRUE)否(FALSE)需要计算验证集中样品的分类概率。
seed: 通常在运行train程序前运行set.seed(1)就可以保证结果的可重复性。
但在使用并行计算时，根据技术不同，可能需要额外设置此参数。
index: 自定义训练集

# Do not run, only for an example
control <- trainControl(method="repeatedcv", number=10, repeats=3, 
                        search="grid", selectionFunction ="tolerance(tol=2)")

train函数控制参数

metric: 模型度量标准；回归分析默认使用RMSE, R-square和MAE (mean absolute error), 分类问题默认使用accuracy和Kappa；如果指定ROC值，需要指定trainControl的summaryFunction为twoClassSummary。
selectionFunction: Caret自带了3个函数，best: 按照设定的metric值选择评分最高或偏差最小的模型; oneSE选择最简单的其性能与最好模型的性能相差在一个标准差之内的模型（同时照顾性能和避免过拟合）；tolerance选择与最好的模型相差在给定阈值范围内的最简单模型。阈值为容忍度，计算方式为 $ (x-x{best})/x{best} * 100$。selectionFunction ="tolerance(tol=2)"表示允许2%的性能降低。
tuneGrid: 接受一个包含参数组合的数据框。列的名字是所用的机器学习方法函数可接受的参数名字。getModelInfo()$rf$parameters 可获得方法rf的所有可调参数。

# Do not run, only for an example
# # 模型评估标准
metric <- "Accuracy"
control <- trainControl(method="repeatedcv", number=10, repeats=3,
                        search="grid", selectionFunction ="tolerance(tol=2)")
set.seed(seed)
tunegrid <- expand.grid(.mtry=c(1:15))

# 设置构建决策树每步决策用到的变量数
# 这里是默认参数
mtry <- sqrt(ncol(expr_mat))

# 构建调参矩阵
# tuneGrid可以接受一个包含参数组合的数据框。列的名字是所用的机器学习方法函数可接受的参数名字.
# getModelInfo()$rf$parameters 可获得方法rf的所有可调参数

tunegrid <- expand.grid(mtry=mtry)

rf_gridsearch <- train(Class~., data=dataset, method="rf", metric=metric, 
                       tuneGrid=tunegrid, trControl=control, selectionFunction)
print(rf_gridsearch)
plot(rf_gridsearch)