- Sigmoid 函数 和 Logistic 回归分类器
- 最优化理论初步
- 梯度下降最优化算法
- 数据中的缺失处理
基于Logistic 回归 和 Sigmoid 函数的分类
Sigmoid 函数 :
性质:
画一下sigmoid 的图像
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-5, 5)
sigmoid = 1 / (1 + np.exp(-x))
fig = plt.figure()
plt.subplots(figsize=(4,6))
ax1 = plt.subplot(2,1,1)
plt.plot(x,sigmoid,color='black')
ax1.set_xlabel("$x$")
ax1.set_ylabel("$\mathrm{sigmoid(\mathbf{x})}$")
x = np.linspace(-60, 60, 100)
sigmoid = 1 / (1 + np.exp(-x))
ax2 = plt.subplot(2,1,2)
plt.plot(x, sigmoid)
ax2.set_xlabel("$x$")
ax2.set_ylabel("$\mathrm{sigmoid(\mathbf{x})}$")
plt.show()
基于最优化方法的最佳回归系数确定
Sigmoid 函数的输入记为 z , 则
写成向量就是
. 其中 w 就是我们要求得参数 ,x 是我们输入的数据。为了求w ,我们采用 梯度上升算法, 和 梯度下降算法相似。
梯度提升算法
其中
是 学习率(学习步长) , 梯度下降
前是减号 ,
梯度上升算法用来求函数的最大值 ,梯度下降用来求函数的最小值
训练:利用梯度提升寻找最优参数
import numpy as np
with open("./testSet.txt", "r") as fr:
data = fr.readlines()
lstData = []
for strLine in data:
lstData.append([float(item) for item in strLine.strip().split("\t")])
arrData = np.array(lstData)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(x=arrData[:, 0], y=arrData[:, 1], c=15.0 * arrData[:, 2])
#ax.legend()
plt.show()
数据可视化
import matplotlib.pyplot as plt
import numpy as np
def loadDataSet(filename,delim='\t') :
dataMat = []
labelMat = []
with open(filename,'r') as fr :
for line in fr.readlines() :
lineArr = line.strip().split()
dataMat.append([1.0,float(lineArr[0]),float(lineArr[1]) ])
labelMat.append(int(lineArr[2]))
return dataMat,labelMat
def sigmoid (inX) :
return 1.0/(1+np.exp(-inX))
def gradAscent(dataMatIn,classLabels) :
dataMatrix = np.matrix(dataMatIn)
labelMat = np.matrix(classLabels).transpose()
m,n = np.shape(dataMatIn)
alpha = 0.001
maxCycles = 500
weights = np.ones((n,1))
for k in range(maxCycles) :
h = sigmoid(dataMatrix*weights)
error = (labelMat - h)
weights = weights + alpha * dataMatrix.transpose()*error
return weights
dataArr , labelMat = loadDataSet('testSet.txt')
weights = gradAscent(dataArr,labelMat)
print(weights)
[[ 4.12414349]
[ 0.48007329]
[-0.6168482 ]]
分析数据 : 画出决策边界
# 画出决策边界
def plotBestFit(weights,filename) :
dataMat ,labelMat = loadDataSet(filename)
dataArr = np.array(dataMat)
n = np.shape(dataArr)[0]
xcord1 = [] ; ycord1 = []
xcord2 = [] ; ycord2 = []
for i in range(n) :
if int(labelMat[i]) == 1 :
xcord1.append(dataArr[i,1]) ; ycord1.append(dataArr[i,2])
else:
xcord2.append(dataArr[i,1]) ; ycord2.append(dataArr[i,2])
fig = plt.figure()
ax = fig.add_subplot(111)
# 画出类别1
ax.scatter(xcord1,ycord1,s=30,c='red',marker='s')
# 画出类别2
ax.scatter(xcord2,ycord2,s=30,c='green')
# 画出分界线
x = np.arange(-3.0,3.0,0.1)
y = (-weights[0] -weights[1]*x)/weights[2]
# print(x.shape)
# print(y.shape)
ax.plot(x,y.T)
plt.xlabel('$X1$') ; plt.ylabel('$X2$')
plt.show()
可以看到这个效果是很不错的。
训练算法 : 随机梯度上升
和随机梯度下降相对应,朴素的梯度上升在每次更新回归系数时都需要遍历整个数据集,在处理数据量较大的时候就会显的收敛的慢,复杂度高。
随机梯度上升:
所有的回归系数初始化为1
对数据集中的每一个样本 :
计算该样本的梯度
使用 alpha * 梯度 -> 更新回归系数
返回回归系数值
# 随机梯度上升
def stocGradAscent0(dataMatrix,classLabels) :
m,n = np.shape(dataMatrix)
alpha = 0.01
weights = np.ones(n)
for i in range(m) :
h = sigmoid(np.sum(dataMatrix[i,:]*weights))
error = classLabels[i] - h
weights = weights + alpha*error*dataMatrix[i,:]
return weights
测试一下 :
dataArr , labelMat = loadDataSet('testSet.txt')
# weights = gradAscent(dataArr,labelMat)
weights = stocGradAscent0(np.array(dataArr),labelMat)
print(weights)
plotBestFit(weights,'testSet.txt')
可以看出这个分类效果表现的并不好,但是究其原因是,数据集比较小,只是收敛到局部最优, 而前者(梯度上升)迭代了500次, 但是这个才迭代了一次,所以不能权衡这两个到底哪个更好。
import random
dataArray = np.array(dataArr)
classLabels = labelMat
m, n = dataArray.shape
alpha = 0.01
weights = np.ones(n)
weights_iter = weights.copy()
iters = list(range(0, m * 200))
for k in range(1, m * 200):
#i = random.randint(0, m-1) # 随机选取样本
i = k % m # 依次选取样本
h = sigmoid(sum(dataArray[i, :] * weights))
error = classLabels[i] - h
weights += alpha * error * dataArray[i, :]
weights_iter = np.vstack((weights_iter, weights))
fig = plt.figure()
ax = fig.add_subplot(311)
ax.plot(iters, weights_iter[:, 0])
ax.set_ylabel("$w_0$")
ax = fig.add_subplot(312)
ax.plot(iters, weights_iter[:, 1])
ax.set_ylabel("$w_1$")
ax = fig.add_subplot(313)
ax.plot(iters, weights_iter[:, 2])
ax.set_ylabel("$w_2$")
ax.set_xlabel("iteration")
plt.show()
plotBestFit(weights)
上图展示了随机梯度上升在 200 次迭代过程中回归系数的变化。在 50 次迭代的时候就差不多达到稳定值了,而梯度上升需要在150次才大致收敛 。
改进随机梯度算法 :
# 改进的随机梯度上升
def stocGradAscent1(dataMatrix,classLabels ,numIter=150 ) :
m,n = np.shape(dataMatrix)
weights = np.ones(n)
for j in range(numIter) :
dataIndex = list(range(m))
for i in range(m) :
alpha = 4/(1.0+j+i) +0.01
randIndex = int(random.uniform(0,len(dataIndex)))
h = sigmoid(np.sum(dataMatrix[randIndex]*weights))
error = classLabels[randIndex] - h
weights = weights + alpha*error*dataMatrix[randIndex,:]
del (dataIndex[randIndex])
return weights
改进的第一个地方是 学习率, 算是半适应学习率吧,随着迭代的进行,学习率会不断减小,但不会减为0 。第二个改进的地方就是通过随机选取样本来更新回归系数。
用改进后的随机梯度上升做 :
示例 : 从疝气病预测病马的死亡率
数据集 : 368个样本和 28 个特征
- 收集数据
- 准备数据:解析文本文件,补充缺失值
- 分析数据:可视化数据
- 训练算法:最优系数
- 测试算法:错误率
- 使用算法: 输出预测结果
准备数据 : 处理数据中的缺失值
方法 :
- 使用全部有效数据的特征均值
- 使用特定值(例如,-1)填充缺失值
- 丢弃包含缺失值的样本
- 使用相似项的均值
- 使用其它的机器学习算法预测缺失值
测试 : 用 logistic 回归进行分类
import numpy as np
import matplotlib.pyplot as plt
from Logistic.logistic import *
def classifyVector(inX, weights):
prob = sigmoid(sum(inX * weights))
if prob > 0.5:
return 1
else:
return 0
def colicTest():
trainingSet = []
trainingLabels = []
with open("horseColicTraining.txt", 'r') as frTrain:
for line in frTrain.readlines():
currLine = line.strip().split("\t")
trainingSet.append([float(item) for item in currLine[: ~0]])
trainingLabels.append(float(currLine[~0]))
trainWeights = stocGradAscent1(np.array(trainingSet), trainingLabels, 500)
errorCount = 0
numTestVec = 0
with open("horseColicTest.txt", 'r') as frTest:
for line in frTest.readlines():
numTestVec += 1
currLine = line.strip().split("\t")
testFeats = [float(item) for item in currLine[: ~0]]
testLabel = float(currLine[~0])
if classifyVector(testFeats, trainWeights) != testLabel:
errorCount += 1
errorRate = errorCount / numTestVec
print("the error rate of this test is: {}".format(errorRate))
return errorRate
def multiTest():
numTests = 10
errorSum = 0.0
for _ in range(numTests):
errorSum += colicTest()
print("after {} iterations the average error rate is: {}".format(numTests, errorSum / numTests))
multiTest()
the error rate of this test is: 0.3582089552238806
the error rate of this test is: 0.373134328358209
the error rate of this test is: 0.31343283582089554
the error rate of this test is: 0.2835820895522388
the error rate of this test is: 0.3582089552238806
the error rate of this test is: 0.2537313432835821
the error rate of this test is: 0.44776119402985076
the error rate of this test is: 0.23880597014925373
the error rate of this test is: 0.31343283582089554
the error rate of this test is: 0.3880597014925373
after 10 iterations the average error rate is: 0.3328358208955223