KNN介绍
K最近邻算法就是搜寻最近的k个已知类别样本用于未知类别样本的预测。
对于离散型的因变量来说,从k个最近的已知类别样本中挑选出频率最高的类别用于未知样本的判断,对于连续型的因变量来说,则是将k个最近的已知样本均值用作未知样本的预测。
而关键问题在于度量样本间相似性的指标。
相似度的度量方法
欧氏距离
曼哈顿距离
余弦相似度
杰卡德相似系数
近邻样本的搜寻方法
KD树搜寻法
KD树搜寻法包含两个重要的步骤,第一个步骤是如何构造一颗二叉树,第二个步骤是如何实现最近邻的搜索。
球树搜寻法
球树搜寻法能够解决KD树的缺陷,因为没有了矩形的“角”。
KNN模型的应用
在R语言算法中可以使用class包的knn函数实现算法。
knn(train,test,cl,k=1,l=1,prob=FALSE, use.all=TRUE)
#k 指的是具体的近邻个数,默认为1个
# l 判定预测结果所使用的的最少投票数,默认为0.
kknn(formula=formula(train),train,test,na.action=na.omit(),k=7,distance=2,kernel="optimal",scale=TRUE)
求得理想k值
使用caret包中的train函数和traincontrol函数实现交叉验证。
rain(x, ...)
## Default S3 method:
train(
x,
y,
method = "rf",
preProcess = NULL,
...,
weights = NULL,
metric = ifelse(is.factor(y), "Accuracy", "RMSE"),
maximize = ifelse(metric %in% c("RMSE", "logLoss", "MAE", "logLoss"), FALSE, TRUE),
trControl = trainControl(),
tuneGrid = NULL,
tuneLength = ifelse(trControl$method == "none", 1, 3)
)
train(
x,
data,
method = "rf",
...,
metric = ifelse(is.factor(y_dat), "Accuracy", "RMSE"),
maximize = ifelse(metric %in% c("RMSE", "logLoss", "MAE"), FALSE, TRUE),
trControl = trainControl(),
tuneGrid = NULL,
tuneLength = ifelse(trControl$method == "none", 1, 3)
)
Argu
# method默认为随机森林
#trControl 指定train函数的其他控制参数
trainControl(
method = "boot",
number = ifelse(grepl("cv", method), 10, 25),
repeats = ifelse(grepl("[d_]cv$", method), 1, NA),
p = 0.75,
search = "grid",
initialWindow = NULL,
horizon = 1,
fixedWindow = TRUE,
skip = 0,
verboseIter = FALSE,
returnData = TRUE,
returnResamp = "final",
savePredictions = FALSE,
classProbs = FALSE,
summaryFunction = defaultSummary,
selectionFunction = "best",
preProcOptions = list(thresh = 0.95, ICAcomp = 3, k = 5, freqCut = 95/5, uniqueCut =
10, cutoff = 0.9),
sampling = NULL,
index = NULL,
indexOut = NULL,
indexFinal = NULL,
timingSamps = 0,
predictionBounds = rep(FALSE, 2),
seeds = NA,
adaptive = list(min = 5, alpha = 0.05, method = "gls", complete = TRUE),
trim = FALSE,
allowParallel = TRUE
)