机器学习实战 -- 逻辑回归-CFANZ编程社区

Sigmoid 函数和 Logistic 回归分类器
最优化理论初步
梯度下降最优化算法
数据中的缺失处理

基于Logistic 回归和 Sigmoid 函数的分类

Sigmoid 函数 :

$\sigma(x)=\frac{1}{1+e^{-x}}$

性质:

$\sigma^{\prime}(x)=\frac{e^{-x}}{\left(1+e^{-x}\right)^{2}}=\sigma(x)(1-\sigma(x))$

画一下sigmoid 的图像

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(-5, 5)
sigmoid = 1 / (1 + np.exp(-x))

fig = plt.figure()
plt.subplots(figsize=(4,6))
ax1 = plt.subplot(2,1,1)

plt.plot(x,sigmoid,color='black')
ax1.set_xlabel("$x$")
ax1.set_ylabel("$\mathrm{sigmoid(\mathbf{x})}$")

x = np.linspace(-60, 60, 100)
sigmoid = 1 / (1 + np.exp(-x))
ax2 = plt.subplot(2,1,2)
plt.plot(x, sigmoid)
ax2.set_xlabel("$x$")
ax2.set_ylabel("$\mathrm{sigmoid(\mathbf{x})}$")

plt.show()

机器学习实战 -- 逻辑回归_缺失值_03

基于最优化方法的最佳回归系数确定

Sigmoid 函数的输入记为 z , 则

$z = w_{0}x_{0} + w_{1}x_{1} +w_{2}x_{2} + ... + w_{n}x_{n}$

写成向量就是

$z = w^{T}x$

. 其中 w 就是我们要求得参数 ,x 是我们输入的数据。为了求w ,我们采用梯度上升算法, 和梯度下降算法相似。

梯度提升算法

$\mathbf{w} :=\mathbf{w}+\alpha \nabla_{\mathbf{w}} f(\mathbf{w})$

其中

$\alpha$

是学习率(学习步长) , 梯度下降

$\alpha$

前是减号 ,

梯度上升算法用来求函数的最大值 ,梯度下降用来求函数的最小值

训练：利用梯度提升寻找最优参数

import numpy as np

with open("./testSet.txt", "r") as fr:
    data = fr.readlines()

lstData = []
for strLine in data:
    lstData.append([float(item) for item in strLine.strip().split("\t")])

arrData = np.array(lstData)

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(x=arrData[:, 0], y=arrData[:, 1], c=15.0 * arrData[:, 2])
#ax.legend()
plt.show()

数据可视化

机器学习实战 -- 逻辑回归_数据_09

import matplotlib.pyplot as plt
import numpy as np

def loadDataSet(filename,delim='\t') :
    dataMat = []
    labelMat = []
    with open(filename,'r') as fr :
        for line in fr.readlines() :
            lineArr = line.strip().split()

            dataMat.append([1.0,float(lineArr[0]),float(lineArr[1]) ])
            labelMat.append(int(lineArr[2]))
    return dataMat,labelMat
def sigmoid (inX) :
    return 1.0/(1+np.exp(-inX))
def gradAscent(dataMatIn,classLabels) :
    dataMatrix = np.matrix(dataMatIn)
    labelMat = np.matrix(classLabels).transpose()
    m,n = np.shape(dataMatIn)

    alpha = 0.001
    maxCycles = 500
    weights = np.ones((n,1))

    for k in range(maxCycles) :
        h = sigmoid(dataMatrix*weights)
        error = (labelMat - h)
        weights = weights + alpha * dataMatrix.transpose()*error
    return weights

dataArr , labelMat = loadDataSet('testSet.txt')
weights = gradAscent(dataArr,labelMat)
print(weights)

[[ 4.12414349]
 [ 0.48007329]
 [-0.6168482 ]]

分析数据 : 画出决策边界

# 画出决策边界
def plotBestFit(weights,filename) :
    dataMat ,labelMat = loadDataSet(filename)
    dataArr = np.array(dataMat)
    n = np.shape(dataArr)[0]

    xcord1 = [] ; ycord1 = []
    xcord2 = [] ; ycord2 = []

    for i in range(n) :
        if int(labelMat[i]) == 1 :
            xcord1.append(dataArr[i,1]) ; ycord1.append(dataArr[i,2])
        else:
            xcord2.append(dataArr[i,1]) ; ycord2.append(dataArr[i,2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    # 画出类别1
    ax.scatter(xcord1,ycord1,s=30,c='red',marker='s')
    # 画出类别2
    ax.scatter(xcord2,ycord2,s=30,c='green')

    # 画出分界线
    x = np.arange(-3.0,3.0,0.1)
    y = (-weights[0] -weights[1]*x)/weights[2]
    # print(x.shape)
    # print(y.shape)
    ax.plot(x,y.T)
    plt.xlabel('$X1$') ; plt.ylabel('$X2$')
    plt.show()

机器学习实战 -- 逻辑回归_数据_10

可以看到这个效果是很不错的。

训练算法 : 随机梯度上升

和随机梯度下降相对应,朴素的梯度上升在每次更新回归系数时都需要遍历整个数据集,在处理数据量较大的时候就会显的收敛的慢,复杂度高。

随机梯度上升:

所有的回归系数初始化为1

对数据集中的每一个样本 :

计算该样本的梯度

使用 alpha * 梯度 -> 更新回归系数

返回回归系数值

# 随机梯度上升
def stocGradAscent0(dataMatrix,classLabels) :
    m,n = np.shape(dataMatrix)
    alpha = 0.01
    weights = np.ones(n)
    for i in range(m) :
        h = sigmoid(np.sum(dataMatrix[i,:]*weights))
        error = classLabels[i] - h
        weights = weights + alpha*error*dataMatrix[i,:]

    return weights

测试一下 :

dataArr , labelMat = loadDataSet('testSet.txt')
# weights = gradAscent(dataArr,labelMat)

weights = stocGradAscent0(np.array(dataArr),labelMat)
print(weights)
plotBestFit(weights,'testSet.txt')

机器学习实战 -- 逻辑回归_数据_11

可以看出这个分类效果表现的并不好,但是究其原因是,数据集比较小,只是收敛到局部最优, 而前者(梯度上升)迭代了500次, 但是这个才迭代了一次,所以不能权衡这两个到底哪个更好。

import random

dataArray = np.array(dataArr)
classLabels = labelMat
m, n = dataArray.shape
alpha = 0.01
weights = np.ones(n)
weights_iter = weights.copy()
iters = list(range(0, m * 200))
for k in range(1, m * 200):
    #i = random.randint(0, m-1)    # 随机选取样本
    i = k % m                     # 依次选取样本
    h = sigmoid(sum(dataArray[i, :] * weights))
    error = classLabels[i] - h
    weights += alpha * error * dataArray[i, :]
    weights_iter = np.vstack((weights_iter, weights))
    
fig = plt.figure()
ax = fig.add_subplot(311)
ax.plot(iters, weights_iter[:, 0])
ax.set_ylabel("$w_0$")
ax = fig.add_subplot(312)
ax.plot(iters, weights_iter[:, 1])
ax.set_ylabel("$w_1$")
ax = fig.add_subplot(313)
ax.plot(iters, weights_iter[:, 2])
ax.set_ylabel("$w_2$")
ax.set_xlabel("iteration")
plt.show()

plotBestFit(weights)

机器学习实战 -- 逻辑回归_缺失值_12

机器学习实战 -- 逻辑回归_缺失值_13

上图展示了随机梯度上升在 200 次迭代过程中回归系数的变化。在 50 次迭代的时候就差不多达到稳定值了,而梯度上升需要在150次才大致收敛。

改进随机梯度算法 :

# 改进的随机梯度上升
def stocGradAscent1(dataMatrix,classLabels ,numIter=150 ) :
    m,n = np.shape(dataMatrix)
    weights = np.ones(n)

    for j in range(numIter) :
        dataIndex = list(range(m))
        for i in range(m) :
            alpha = 4/(1.0+j+i) +0.01
            randIndex = int(random.uniform(0,len(dataIndex)))
            h = sigmoid(np.sum(dataMatrix[randIndex]*weights))
            error = classLabels[randIndex] - h
            weights = weights + alpha*error*dataMatrix[randIndex,:]
            del (dataIndex[randIndex])
    return weights

改进的第一个地方是学习率, 算是半适应学习率吧,随着迭代的进行,学习率会不断减小,但不会减为0 。第二个改进的地方就是通过随机选取样本来更新回归系数。

机器学习实战 -- 逻辑回归_迭代_14

用改进后的随机梯度上升做 :

机器学习实战 -- 逻辑回归_数据_15

示例 : 从疝气病预测病马的死亡率

数据集 : 368个样本和 28 个特征

收集数据
准备数据：解析文本文件，补充缺失值
分析数据：可视化数据
训练算法：最优系数
测试算法：错误率
使用算法：输出预测结果

准备数据 : 处理数据中的缺失值

方法 :

使用全部有效数据的特征均值
使用特定值（例如，-1）填充缺失值
丢弃包含缺失值的样本
使用相似项的均值
使用其它的机器学习算法预测缺失值

测试 : 用 logistic 回归进行分类

import numpy as np
import matplotlib.pyplot as plt
from Logistic.logistic import *


def classifyVector(inX, weights):
    prob = sigmoid(sum(inX * weights))
    if prob > 0.5:
        return 1
    else:
        return 0

def colicTest():
    trainingSet = []
    trainingLabels = []
    with open("horseColicTraining.txt", 'r') as frTrain:
        for line in frTrain.readlines():
            currLine = line.strip().split("\t")
            trainingSet.append([float(item) for item in currLine[: ~0]])
            trainingLabels.append(float(currLine[~0]))

    trainWeights = stocGradAscent1(np.array(trainingSet), trainingLabels, 500)

    errorCount = 0
    numTestVec = 0
    with open("horseColicTest.txt", 'r') as frTest:
        for line in frTest.readlines():
            numTestVec += 1
            currLine = line.strip().split("\t")

            testFeats = [float(item) for item in currLine[: ~0]]
            testLabel = float(currLine[~0])

            if classifyVector(testFeats, trainWeights) != testLabel:
                errorCount += 1

    errorRate = errorCount / numTestVec
    print("the error rate of this test is: {}".format(errorRate))

    return errorRate


def multiTest():
    numTests = 10
    errorSum = 0.0

    for _ in range(numTests):
        errorSum += colicTest()

    print("after {} iterations the average error rate is: {}".format(numTests, errorSum / numTests))


multiTest()

the error rate of this test is: 0.3582089552238806
the error rate of this test is: 0.373134328358209
the error rate of this test is: 0.31343283582089554
the error rate of this test is: 0.2835820895522388
the error rate of this test is: 0.3582089552238806
the error rate of this test is: 0.2537313432835821
the error rate of this test is: 0.44776119402985076
the error rate of this test is: 0.23880597014925373
the error rate of this test is: 0.31343283582089554
the error rate of this test is: 0.3880597014925373
after 10 iterations the average error rate is: 0.3328358208955223