一. 线性回归
1. lm()函数返回的是对于输入变量的预测模型,返回的结果可以配合许多函数进行使用。
> lm.model <- lm(wt ~ mpg, data = mtcars)
> coefficients(lm.model) # 提取系数
(Intercept) mpg
6.047255 -0.140862
> confint(lm.model, level=0.95) # 得到线性模型相关系数的分布后,限定区间,得到边界点的值
2.5 % 97.5 %
(Intercept) 5.4168245 6.6776856
mpg -0.1709569 -0.1107671
> fitted(lm.model) # 用建立的模型去预测wt
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant Duster 360
3.089154 3.089154 2.835602 3.032809 3.413136 3.497653 4.032929
Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE Merc 450SL Merc 450SLC
2.610223 2.835602 3.342705 3.539912 3.737119 3.610343 3.906153
Cadillac Fleetwood Lincoln Continental Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona
4.582291 4.582291 3.976584 1.483327 1.765051 1.272034 3.018723
Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
3.863894 3.906153 4.173791 3.342705 2.201723 2.384844 1.765051
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
3.821636 3.272274 3.934325 3.032809
> residuals(lm.model) # 预测与实际之间的残差
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant Duster 360
-0.46915365 -0.21415365 -0.51560210 0.18219114 0.02686382 -0.03765336 -0.46292884
Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE Merc 450SL Merc 450SLC
0.57977705 0.31439790 0.09729481 -0.09991195 0.33288129 0.11965707 -0.12615307
Cadillac Fleetwood Lincoln Continental Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona
0.66770947 0.84170947 1.36841594 0.71667281 -0.15005113 0.56296577 -0.55372266
Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
-0.34389448 -0.47115307 -0.33379081 0.50229481 -0.26672324 -0.24484380 -0.25205113
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
-0.65163589 -0.50227421 -0.36432547 -0.25280886
> anova(lm.model) # anova table
Analysis of Variance Table
Response: wt
Df Sum Sq Mean Sq F value Pr(>F)
mpg 1 22.3431 22.3431 91.375 1.294e-10 ***
Residuals 30 7.3356 0.2445
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> vcov(lm.model) # 协方差矩阵
(Intercept) mpg
(Intercept) 0.095289963 -0.0043626666
mpg -0.004362667 0.0002171494
> influence(lm.model) # 相当与对分析的结果进行一个汇总
$hat
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant Duster 360
0.03198439 0.03198439 0.03776901 0.03277255 0.03296737 0.03476902 0.06102792
Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE Merc 450SL Merc 450SLC
0.04774195 0.03776901 0.03195442 0.03590963 0.04334604 0.03816586 0.05249086
Cadillac Fleetwood Lincoln Continental Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona
0.11464634 0.11464634 0.05705606 0.16580983 0.12563611 0.20060244 0.03301399
Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
0.04996488 0.05249086 0.07220085 0.03195442 0.07740711 0.06226177 0.12563611
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
0.04759875 0.03138551 0.05426366 0.03277255
$coefficients
(Intercept) mpg
Mazda RX4 -0.007282030 -3.913985e-04
Mazda RX4 Wag -0.003324014 -1.786609e-04
Datsun 710 0.009157470 -1.289282e-03
Hornet 4 Drive 0.001485910 2.190313e-04
Hornet Sportabout 0.001557359 -3.430679e-05
Valiant -0.002604528 6.896128e-05
Duster 360 -0.066342655 2.535306e-03
Merc 240D -0.027785644 2.330044e-03
Merc 230 -0.005583936 7.861634e-04
Merc 280 0.004737902 -7.949360e-05
Merc 280C -0.007473905 2.108129e-04
Merc 450SE 0.033786317 -1.140454e-03
Merc 450SL 0.010081735 -3.083068e-04
Merc 450SLC -0.015778260 5.782587e-04
Cadillac Fleetwood 0.153962430 -6.490318e-03
Lincoln Continental 0.194083866 -8.181645e-03
Chrysler Imperial 0.184925710 -6.947280e-03
Fiat 128 -0.161833613 9.391507e-03
Honda Civic 0.026202892 -1.571169e-03
Toyota Corolla -0.151504789 8.636477e-03
Toyota Corona -0.003495500 -7.167077e-04
Dodge Challenger -0.040959820 1.475709e-03
AMC Javelin -0.058928219 2.159665e-03
Camaro Z28 -0.054830702 2.169570e-03
Pontiac Firebird 0.024459924 -4.103942e-04
Fiat X1-9 0.028152063 -1.850938e-03
Porsche 914-2 0.019369325 -1.370227e-03
Lotus Europa 0.044014788 -2.639199e-03
Ford Pantera L -0.073758568 2.607048e-03
Ferrari Dino -0.019818649 1.798843e-04
Maserati Bora -0.047027088 1.741542e-03
Volvo 142E -0.002061852 -3.039283e-04
$sigma
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant Duster 360
0.4950874 0.5013168 0.4933815 0.5017657 0.5029179 0.5028932 0.4950577
Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE Merc 450SL Merc 450SLC
0.4906934 0.4994096 0.5026082 0.5025884 0.4989569 0.5024330 0.5023674
Cadillac Fleetwood Lincoln Continental Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona
0.4853738 0.4747194 0.4295043 0.4813738 0.5020600 0.4891637 0.4919538
Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
0.4986579 0.4948469 0.4988098 0.4939281 0.5002931 0.5007472 0.5004465
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
0.4874197 0.4939342 0.4981090 0.5006732
$wt.res
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant Duster 360
-0.46915365 -0.21415365 -0.51560210 0.18219114 0.02686382 -0.03765336 -0.46292884
Merc 240D Merc 230 Merc 280 Merc 280C Merc 450SE Merc 450SL Merc 450SLC
0.57977705 0.31439790 0.09729481 -0.09991195 0.33288129 0.11965707 -0.12615307
Cadillac Fleetwood Lincoln Continental Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla Toyota Corona
0.66770947 0.84170947 1.36841594 0.71667281 -0.15005113 0.56296577 -0.55372266
Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
-0.34389448 -0.47115307 -0.33379081 0.50229481 -0.26672324 -0.24484380 -0.25205113
Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
-0.65163589 -0.50227421 -0.36432547 -0.25280886
2. 拟合好坏的评判
(1) 根据模型获得预测值
predict(model, 用于训练产生model的数据集)
(2) 使用caret包中的R2函数
R2(predict结果,原数据集中的y值)
二. 多元线性回归
1. lm()及interaction的分析
(1) 直接向fomular中输入多个变量,一加号连接即可:
mtcars.m <- lm(wt~mpg+disp+mpg:disp, data = mtcars)
##分析自变量disp和mpg对因变量wt的模型,其中mpg:disp表示构造线性模型时,考虑两个变量之间的相互作用
##即两个变量不是相互独立
summary(mtcars.m)
模型:
> summary(mtcars.m)
Call:
lm(formula = wt ~ mpg + disp + mpg:disp, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-0.8047 -0.2636 0.0539 0.2340 0.8817
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.3716803 0.7066020 4.772 5.18e-05 ***
mpg -0.0505361 0.0257971 -1.959 0.06014 .
disp 0.0065258 0.0020525 3.179 0.00359 **
mpg:disp -0.0001603 0.0001232 -1.301 0.20383
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4055 on 28 degrees of freedom
Multiple R-squared: 0.8449, Adjusted R-squared: 0.8283
F-statistic: 50.84 on 3 and 28 DF, p-value: 1.871e-11
(2) 自变量的影响权重
统计不同变量对模型的“重要程度”:
library(relaimpo);
calc.relimp( wt~mpg+disp+mpg:disp, data = mtcars )
> calc.relimp( wt~mpg+disp+mpg:disp, data = mtcars )
Response variable: wt
Total response variance: 0.957379
Analysis based on 32 observations
3 Regressors:
mpg disp mpg:disp
Proportion of variance explained by model: 84.49%
Metrics are not normalized (rela=FALSE).
Relative importance metrics: ##影响权重
lmg
mpg 0.399914296
disp 0.435589844
mpg:disp 0.009378141
Average coefficients for different model sizes:
1X 2Xs 3Xs
mpg -0.140861970 -0.066315339 -0.0505361268
disp 0.007010325 0.004277131 0.0065257929
mpg:disp NaN NaN -0.0001603097
(3) 可视化自变量间是如何影响的
相互影响主要是体现在对斜率的改变:
library(interactions)
sim_slopes(mtcars.m, pred = mpg, modx = disp, jnplot = TRUE)
##pred 输入被影响的量 modx 输入影响别人的量
结果:
JOHNSON-NEYMAN INTERVAL
When disp is INSIDE the interval [8.39, 747.98], the slope of mpg is p < .05.
Note: The range of observed values of disp is [71.10, 472.00]
SIMPLE SLOPES ANALYSIS
##随意取得数据,结果有些离谱
2. glm()
当glm函数中的参数famil = “Gaussian” 时, 你拟合的效果于lm()一致,但glm还可以用于其他回归模型,例如二分类时线性回归:
##利用glm对swiss中的变量建立线性模型,当输入不同的自变量时,我们可以通过数值结果判断对应的Species类型
dat <- iris %>% filter( Species %in% c("setosa", "virginica") )
dat$Species <- as.factor(dat$Species)##需要对“因变量”对应的列进行处理,否则可能会报错
bm <- glm( Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = dat, family = binomial ) ##选择二项分布的模型
mtcars.predict1 <- predict(bm, dat,type = "response")
mtcars.predict2 <- predict(bm, dat) ##用默认的type
结果:注意设置了response的模型,输出数值更加有只管的二分类效果
> mtcars.predict1
1 2 3 4 5 6 7 8 9 10 11 12
2.220446e-16 4.732528e-13 1.369834e-13 3.388726e-12 2.220446e-16 3.892826e-13 7.748530e-13 1.988060e-13 5.204109e-12 3.900993e-13 2.220446e-16 1.453927e-12
13 14 15 16 17 18 19 20 21 22 23 24
2.656364e-13 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 2.220446e-16 4.472442e-13 2.682276e-13 2.220446e-16 4.132652e-11
25 26 27 28 29 30 31 32 33 34 35 36
5.282004e-11 3.493413e-12 4.614812e-12 2.220446e-16 2.220446e-16 4.976501e-12 5.082694e-12 2.857677e-13 2.220446e-16 2.220446e-16 1.032749e-12 2.220446e-16
37 38 39 40 41 42 43 44 45 46 47 48
2.220446e-16 2.220446e-16 1.035326e-12 1.337875e-13 2.220446e-16 3.421128e-11 4.494793e-13 2.131127e-11 2.126479e-11 1.861773e-12 2.220446e-16 6.741667e-13
49 50 51 52 53 54 55 56 57 58 59 60
2.220446e-16 2.220446e-16 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
61 62 63 64 65 66 67 68 69 70 71 72
1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
73 74 75 76 77 78 79 80 81 82 83 84
1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
85 86 87 88 89 90 91 92 93 94 95 96
1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
97 98 99 100
1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
> mtcars.predict2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
-31.25726 -28.37915 -29.61892 -26.41057 -31.27837 -28.57447 -27.88610 -29.24645 -25.98157 -28.57237 -32.08233 -27.25675 -28.95665 -30.56889 -38.51083
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
-34.24374 -33.36463 -30.28368 -30.31909 -30.33771 -28.43567 -28.94694 -34.48423 -23.90952 -23.66413 -26.38014 -26.10175 -30.45579 -31.23614 -26.02629
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
-26.00518 -28.88360 -33.93252 -35.76190 -27.59880 -32.00469 -34.03910 -31.85587 -27.59630 -29.64252 -31.08514 -24.09847 -28.43069 -24.57178 -24.57397
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
-27.00949 -30.11375 -28.02530 -31.68625 -30.02680 42.30338 30.16757 35.29448 32.36693 37.44698 41.69689 25.43424 36.78895 34.84647 38.68465
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
26.28266 30.18619 31.69255 31.17407 34.61827 31.99455 29.96005 40.13440 48.50935 26.36952 34.80433 29.12103 42.35915 24.81853 33.23214
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
32.34083 23.59987 24.35911 35.30878 28.83298 35.38856 33.80247 36.28236 23.87569 30.51635 37.26026 36.12244 29.93893 23.55764 29.68174
91 92 93 94 95 96 97 98 99 100
35.78971 28.03628 30.16757 37.59549 37.12645 30.44316 27.82403 28.31458 33.14986 27.54634
三. 多元非线性回归
1. 机器学习--回归分析
(1) 使用earth和caret工具包对数据进行分析
例如对swiss中,所有其他元素对Education进行回归分析;
nls.model <- earth(Education~.,data = swiss)
nls.model
plot(nls.model)
(2) 对结果进行参数调整(训练)
以n-Fold交叉验证为例,使用到caret包里的train函数:
train(x, y, method = "rf", preProcess = NULL, ...,
weights = NULL, metric = ifelse(is.factor(y), "Accuracy", "RMSE"),
maximize = ifelse(metric %in% c("RMSE", "logLoss", "MAE"), FALSE, TRUE),
trControl = trainControl(), tuneGrid = NULL,
tuneLength = ifelse(trControl$method == "none", 1, 3))
其中:
x :拟合时输入的自变量
y:拟合时输入的因变量
weights:根据模型所设置的权重,非必要
trControl:需要调用来自于trainControl()返回的结果,以设计合适的检验手段
tuneGrid:调参时的应用的方法,不同的method对应的调参方法不同,所以输入的grid的列名也不同(函数是通过识别输入的列的列名来控制参数调整的)
例如:对swiss的Fertility建立一个拟合函数,因变量是其余所有的列:
grid <- expand.grid(
degree = 1:3,
nprune = seq(2, 100, length.out = 8)
)
##对于method专门构建的一个调参网格,degree代表迭代深度,nprune代表特征选取量
##expand.grid函数会自适应地根据行和列对grid进行生成,满足数据之间的匹配关系
set.seed(12222)##生成随机种子保证模型的可重复性
swiss.model1 <- train(
x = subset(swiss, select = -Fertility),
y = swiss$Fertility,
method = "earth",
metric = "RMSE",
trControl = trainControl(method = "cv", number = 10), ## 10 折的交叉验证
tuneGrid = grid
)
结果:
> grid
degree nprune
1 1 2
2 2 2
3 3 2
4 1 16
5 2 16
6 3 16
7 1 30
8 2 30
9 3 30
10 1 44
11 2 44
12 3 44
13 1 58
14 2 58
15 3 58
16 1 72
17 2 72
18 3 72
19 1 86
20 2 86
21 3 86
22 1 100
23 2 100
24 3 100
> swiss.model1
Multivariate Adaptive Regression Spline
47 samples
5 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 42, 42, 42, 42, 42, 44, ...
Resampling results across tuning parameters:
degree nprune RMSE Rsquared MAE
1 2 11.278358 0.3284905 9.209832
1 16 9.541177 0.5068536 7.839405
1 30 9.541177 0.5068536 7.839405
1 44 9.541177 0.5068536 7.839405
1 58 9.541177 0.5068536 7.839405
1 72 9.541177 0.5068536 7.839405
1 86 9.541177 0.5068536 7.839405
1 100 9.541177 0.5068536 7.839405
2 2 11.294113 0.3270052 9.216748
2 16 10.717470 0.4382987 8.802805
2 30 10.717470 0.4382987 8.802805
2 44 10.717470 0.4382987 8.802805
2 58 10.717470 0.4382987 8.802805
2 72 10.717470 0.4382987 8.802805
2 86 10.717470 0.4382987 8.802805
2 100 10.717470 0.4382987 8.802805
3 2 11.294113 0.3270052 9.216748
3 16 10.798077 0.4366048 8.891187
3 30 10.798077 0.4366048 8.891187
3 44 10.798077 0.4366048 8.891187
3 58 10.798077 0.4366048 8.891187
3 72 10.798077 0.4366048 8.891187
3 86 10.798077 0.4366048 8.891187
3 100 10.798077 0.4366048 8.891187
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nprune = 16 and degree = 1.
(3) 用vip展示结果模型中,不同自变量的重要性
library(vip);
p1 <- vip(swiss.model1, num_features = 4, geom = "point", value = "gcv")
其中num_features控制需要展示的变量的数目(优先展示影响权重大的),value是量化重要性的指标,“gcv”表示 Generalized cross−validation。还可以输入“rss” 以 Residual Sums of Squares来衡量重要性。
2. 逻辑回归
> m <- drm(cyl ~ wt, fct = L.3(), data = mtcars)
> plot(m, log="", main = "Logistic function")
##举个例子,自然数据不符合逻辑回归曲线
3. 非线性的最小二乘法
(1) 使用nls函数
nls(formula, data, start, control, algorithm,
trace, subset, weights, na.action, model,
lower, upper, ...)
formula:公式,需要预先知道回归模型,函数只是去拟合合适的参数
start:可以是一个list,里面记录了对于一个“不能自己启动的”模型函数中,各个参数的初值
control:对拟合条件的一个控制
set.seed(43247)
x<-seq(0, 100, 1)
y<-(rnorm(1,0,1)*x^2+runif(1,0,1)*x^3 +rnorm(101,0,1))
m<-nls(y~(a*x^2+b*x^3),
control = nls.control(maxiter = 100000), ##控制最大迭代次数
start = list(a=1, b=1));
结果:
> m
Nonlinear regression model
model: y ~ (a * x^2 + b * x^3)
data: parent.frame()
a b
-0.7908 0.3721
residual sum-of-squares: 87.8
Number of iterations to convergence: 1
Achieved convergence tolerance: 5.332e-09
(2) 使用drm函数自动拟合
无需预先知道函数形式,只用选择输入自带的模型,自适应的start地进行拟合
library(drc)
m <- drm( y ~ x, fct = LL.3() ); ## 自带模型
dat = data.frame(x=x,y=y,nls=predict(m)) ##当predict只输入模型时相当于得到一个函数文件
ggplot( dat,aes(x=x,y=y) ) +
geom_point() + geom_line( aes(x = x, y = nls), colour = "red");
'''
##fct:a list with three or more elements specifying the non-linear function, the accompanying self starter function, the names of the parameter in the non-linear function and, optionally, the first and second derivatives as well as information used for calculation of ED values. Currently available functions include, among others, the four- and five-parameter log-logistic models LL.4, LL.5 and the Weibull model W1.4. Use getMeanFunctions for a full list.
LL.3:
LL.3(fixed = c(NA, NA, NA), names = c("b", "d", "e"), ...)
f(x) = 0 + \frac{d-0}{1+\exp(b(\log(x)-\log(e)))}
'''