【sklearn学习】支持向量机SVM-CFANZ编程社区

class sklearn.svm.SVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=- 1, decision_function_shape='ovr', break_ties=False, random_state=None)[source]

kernel：SVM核函数

能够使用数据原始空间中的向量计算来表示升维后的空间中的点积结果的数学方式，这个在原始空间中的点积函数，被叫做“核函数”

核函数的作用：

1. 确保了高维空间中任意两个向量的点积一定可以被低维空间中的两个向量的某种计算来表示

2. 使用核函数计算低维度中的向量关系更简单

因为计算是在原始空间中进行的，避免了维度诅咒的问题

输入	含义	解决问题
“linear”	线性核	线性
“poly”	多项式核	偏线性
“sigmoid”	双曲正切核	非线性
“rbf”	高斯径向基	偏非线性

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer, load_wine, load_boston
from sklearn.model_selection import train_test_split
from sklearn import svm

cancer = load_breast_cancer()
wine = load_wine()
boston = load_boston()

X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.2)
cls = svm.SVC(kernel='linear')
cls.fit(X_train, y_train)
print('Coefficients:%s, intercept %s'%(cls.coef_,cls.intercept_))
print('Score: %.2f' % cls.score(X_test, y_test))

X_train, X_test, y_train, y_test = train_test_split(bonston.data, boston.target, test_size=0.2)
reg = svm.LinearSVR()
reg.fit(X_train, y_train)
print('Coefficients:%s, intercept %s'%(reg.coef_,reg.intercept_))
print('Score: %.2f' % reg.score(X_test, y_test))

支持向量机分类器，是在数据空间中找出一个超平面作为决策边界，利用这个决策边界来对数据进行分类，并使分类误差尽量小的模型。

软间隔：让决策边界能够忍受小部分数据训练误差，而不是单纯追求最大边际。

需要找出“最大边际”与“被分错的样本数量”之间的平衡

引入松弛系数C

C较大，SVC选择边际较小的，能够更好分类所有训练点的决策边界

C较小，SVC最大化决策边界，将掉落在决策边界另一方的样本点预测正确，决策功能简单

二分类SVC中的样本不均衡问题，class_weight

较大的权重加在少数类的样本上，迫使模型向着少数类的方向建模

def test_LinearSVC_C(*data):
    '''
    测试 LinearSVC 的预测性能随参数 C 的影响

    :param data: 可变参数。它是一个元组，这里要求其元素依次为：训练样本集、测试样本集、训练样本的标记、测试样本的标记
    :return:   None
    '''
    X_train,X_test,y_train,y_test=data
    Cs=np.logspace(-2,1)
    train_scores=[]
    test_scores=[]
    for C in Cs:
        cls=svm.LinearSVC(C=C)
        cls.fit(X_train,y_train)
        train_scores.append(cls.score(X_train,y_train))
        test_scores.append(cls.score(X_test,y_test))

    ## 绘图
    fig=plt.figure()
    ax=fig.add_subplot(1,1,1)
    ax.plot(Cs,train_scores,label="Traing score")
    ax.plot(Cs,test_scores,label="Testing score")
    ax.set_xlabel(r"C")
    ax.set_ylabel(r"score")
    ax.set_xscale('log')
    ax.set_title("LinearSVC")
    ax.legend(loc='best')
    plt.show()

test_LinearSVC_C(X_train,X_test,y_train,y_test) # 调用 test_LinearSVC_C

从准确度的角度，不做样本平衡的时候准确率反而更高，做了样本平衡准确率反而变低

因为做了样本平衡后，为了更有效捕捉少数类，模型误分类了许多多数类样本，而多数类被分错的样本数量大于少数类被分类正确的样本数量，使得模型整体的精确性下降。

准确率

精度度

召回率

混淆矩阵

SVM实现概率预测

重要参数：probability

重要接口：predict_proba、decision_function

probability在训练时设置为True，SVC的接口predict_proba、decision_function生效