【机器学习】任务三：基于逻辑回归与线性回归的鸢尾花分类与波士顿房价预测分析-CFANZ编程社区

线性回归：假设自变量与因变量之间存在线性关系，适用于简单且明确的线性问题。
岭回归（Ridge Regression）：在线性回归的基础上加上L2正则化，用于解决多重共线性问题。
LASSO回归（Least Absolute Shrinkage and Selection Operator）：加上L1正则化，可以使某些系数变为0，具有特征选择功能。
多项式回归：适用于非线性数据，可以通过增加特征的多项式项来提高预测效果。
支持向量回归（SVR）：用于非线性回归问题，通过核函数将数据映射到高维空间进行线性回归。

回归分析步骤：

数据预处理：检查数据的完整性，处理缺失值，归一化或标准化数据。
特征选择：从众多自变量中选择与目标变量相关性较高的变量进行建模，减少模型的复杂性。
数据集划分：将数据划分为训练集和测试集（常见划分比例为80:20），确保模型的泛化能力。
模型选择：根据数据特点选择合适的回归模型。
模型评估：使用均方误差（MSE）、R²值、平均绝对误差（MAE）等指标评估模型的效果。
模型优化：通过调整模型的参数（如正则化系数）、交叉验证和网格搜索等方法来优化模型。

1.3 掌握特征重要性分析、特征选择和模型优化的方法

特征重要性分析：

在回归模型中，特征的重要性可以通过以下方法评估：

线性回归系数：线性回归模型中的系数值反映了每个特征对预测结果的影响大小。
基于树的模型（例如随机森林和梯度提升树）：这些模型能够直接输出特征的重要性。树模型通过拆分点的重要性来评估特征的贡献。

# 特征重要性分析代码示例
coefficients = pd.Series(model.coef_, index=X.columns)
coefficients = coefficients.sort_values(ascending=False)
plt.figure(figsize=(10, 6))
coefficients.plot(kind='barh')
plt.title('特征重要性分析', fontproperties=font)
plt.xlabel('系数值', fontproperties=font)
plt.ylabel('特征', fontproperties=font)
plt.show()

特征选择：

特征选择是通过减少不重要的特征来简化模型，提升预测的准确性和泛化能力的方法。常用的特征选择方法有：

过滤法：使用统计方法（如皮尔逊相关系数、卡方检验等）来选择与目标变量相关性较高的特征。
包裹法：通过模型进行评估，选择能最大化模型表现的特征组合，例如递归特征消除（RFE）。
嵌入法：模型训练时自动选择重要特征，例如LASSO回归。

模型优化：

模型优化是提升模型预测能力的关键，以下是几种常用的优化方法：

正则化：通过添加L1或L2正则化项，防止模型过拟合。
超参数调整：通过网格搜索（Grid Search）或随机搜索（Random Search）寻找最佳的超参数组合。
交叉验证：将数据分成多折进行训练和验证，避免模型在某个固定训练集上过拟合。

# 交叉验证代码示例
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(model, X, y, cv=10, scoring='neg_mean_squared_error')
cv_scores_mean = -cv_scores.mean()
print(f'10折交叉验证的平均MSE：{cv_scores_mean}')

总结：

回归分析：用于分析特征与目标变量之间的关系，常用于连续变量的预测任务。
数据预测：通过机器学习的回归模型，进行模型选择、训练和评估，提升预测效果。
特征重要性与模型优化：通过特征选择、正则化、超参数调整和交叉验证，简化模型并提升预测能力。

2.波士顿房价预测与特征分析

2.1第一步：导入所需的模块和包

我们首先需要导入机器学习项目中常用的库，这些库用于数据处理、建模和可视化。

# 导入必要的库
import pandas as pd  # 用于数据处理
import numpy as np  # 用于科学计算
import matplotlib.pyplot as plt  # 用于绘制图形
import seaborn as sns  # 用于绘制高级图形
from sklearn.model_selection import train_test_split  # 用于分割训练集和测试集
from sklearn.linear_model import LinearRegression  # 用于构建线性回归模型
from sklearn.metrics import mean_squared_error, r2_score  # 用于模型评估
from matplotlib.font_manager import FontProperties  # 用于设置中文字体

解释：

pandas 用于数据的加载、处理和分析。
numpy 用于执行数学计算。
matplotlib.pyplot 和 seaborn 用于数据的可视化。
train_test_split 用于将数据集划分为训练集和测试集。
LinearRegression 是线性回归模型的构建模块。
mean_squared_error 和 r2_score 用于评估模型性能。

2.2 第二步：加载波士顿房价数据集

我们使用 TensorFlow 的 Keras 库从波士顿房价数据集中加载数据并将其分割为特征与目标变量。

# 加载本地数据集
file_path = r'C:\Users\Administrator\Desktop\ML\机器学习\实验任务二\data\boston_housing.csv'
df = pd.read_csv(file_path)

# 数据探索
print(df.head())  # 查看数据前5行
print(df.describe())  # 查看数据统计信息
print(df.columns)  # 查看数据集的列名

解释：

pd.read_csv() 用于从 CSV 文件中加载数据。
df.head() 用于查看数据的前5行，了解数据的基本结构。
df.describe() 提供了数值型数据的描述统计信息，如平均值、标准差等。
df.columns 打印数据集的列名，确保列名正确。

2.3 第三步：数据预处理与分割

在进行建模之前，我们需要处理数据，将特征与目标变量分开，并将数据集划分为训练集和测试集。

# 检查是否有缺失值
print(df.isnull().sum())  # 输出每个特征的缺失值数量

# 分割特征和目标变量
X = df.drop('MEDV', axis=1)  # 'MEDV'是目标变量，表示房价
y = df['MEDV']

# 将数据划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

解释：

df.isnull().sum() 检查数据集中是否存在缺失值，确保数据质量。
X = df.drop('MEDV', axis=1) 移除目标变量（房价），X 是特征矩阵。
y = df['MEDV'] 将目标变量提取出来。
train_test_split 用于将数据集划分为训练集和测试集，其中 20% 的数据作为测试集。

2.4 第四步：建立并训练线性回归模型

我们使用线性回归模型对数据进行训练，训练集用于模型的拟合。

# 创建线性回归模型
model = LinearRegression()

# 训练模型
model.fit(X_train, y_train)

解释：

LinearRegression() 创建线性回归模型的实例。
model.fit(X_train, y_train) 通过训练数据对模型进行训练，调整模型参数，使得模型可以根据训练数据预测房价。

2.5 第五步：进行预测并评估模型

模型训练完成后，使用测试集进行预测，并使用均方误差（MSE）和 R²值来评估模型性能。

# 使用测试集进行预测
y_pred = model.predict(X_test)

# 评估模型性能
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# 打印评估结果
print(f'均方误差（MSE）：{mse}')
print(f'R²值：{r2}')

解释：

model.predict(X_test) 使用训练好的模型对测试集进行预测。
mean_squared_error() 计算均方误差，衡量模型预测的误差大小，误差越小，模型越好。
r2_score() 计算 R²值，表示模型解释数据变化的比例，值越接近 1 表示模型越好。

2.6 第六步：可视化真实值与预测值的关系

我们通过散点图展示真实房价与预测房价的关系，理想情况下，散点应接近对角线。

# 可视化真实房价与预测房价
plt.scatter(y_test, y_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)  # 绘制对角线
plt.xlabel('实际房价', fontproperties=font)
plt.ylabel('预测房价', fontproperties=font)
plt.title('实际房价 vs 预测房价', fontproperties=font)
plt.show()

解释：

plt.scatter() 用于绘制散点图，横坐标为实际房价，纵坐标为预测房价。
plt.plot() 绘制一条对角线，表示理想情况下预测值应与实际值一致。
如果散点接近对角线，说明模型的预测效果较好。

2.7 第七步：残差分析

残差是指预测值与实际值之间的差距。我们通过柱状图分析残差的分布，检查模型是否存在系统性偏差。

# 残差分析
residuals = y_test - y_pred
plt.hist(residuals, bins=20)
plt.xlabel('残差', fontproperties=font)
plt.ylabel('频数', fontproperties=font)
plt.title('残差分布', fontproperties=font)
plt.show()

解释：

residuals = y_test - y_pred 计算残差，即实际值与预测值的差。
plt.hist() 绘制残差的分布直方图，观察其分布是否接近正态分布。

2.5 第八步：特征相关性分析

通过热力图分析特征之间的相关性，以便了解哪些特征之间存在较强的线性关系。

# 特征相关性分析
corr_matrix = df.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('特征相关性热力图', fontproperties=font)
plt.show()

解释：

df.corr() 计算数据集中每个特征之间的相关性系数。
sns.heatmap() 绘制热力图，颜色越深表示相关性越强，帮助我们直观地理解特征之间的关系。

2.9 第九步：特征重要性分析

我们通过线性回归模型的系数来分析各个特征对房价预测的影响，并通过条形图展示特征的重要性。

# 特征重要性分析
coefficients = pd.Series(model.coef_, index=X.columns)
coefficients = coefficients.sort_values(ascending=False)
plt.figure(figsize=(10, 6))
coefficients.plot(kind='barh')
plt.title('特征重要性分析', fontproperties=font)
plt.xlabel('系数值', fontproperties=font)
plt.ylabel('特征', fontproperties=font)
plt.show()

解释：

model.coef_ 返回线性回归模型中每个特征的系数，系数越大表示该特征对预测结果的影响越大。
coefficients.plot(kind='barh') 绘制条形图，显示特征的重要性排序。

3.鸢尾花数据集的逻辑回归分析

3.1 步骤 1：导入所需的模块与包

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from matplotlib.font_manager import FontProperties

解释：

sklearn.datasets：用于加载鸢尾花数据集。
train_test_split：用于将数据集划分为训练集和测试集。
LogisticRegression：逻辑回归模型，用于多分类任务。
classification_report 和 confusion_matrix：用于模型的评估，提供分类报告和混淆矩阵。
matplotlib 和 seaborn：用于可视化，包括绘制特征关系图和混淆矩阵。
numpy 和 pandas：用于数据操作和处理。
FontProperties：用于设置图表中的中文字体。

3.2 步骤 2：加载鸢尾花数据集

# 设置中文字体
font = FontProperties(fname=r'C:\Windows\Fonts\simhei.ttf')  # 修改为系统中中文字体的路径

# 加载鸢尾花数据集
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

解释：

加载数据集：load_iris() 用于加载鸢尾花数据集。X 是特征矩阵，y 是标签。feature_names 和 target_names 分别是特征和标签的名称。
中文字体：通过 FontProperties 设置中文字体，确保可视化图表中的中文能够正常显示。

3.3 步骤 3：数据探索与可视化

# 数据探索
df = pd.DataFrame(X, columns=feature_names)
df['Species'] = y
print(df.head())

# 特征关系可视化
sns.pairplot(df, hue='Species')
plt.show()

解释：

数据探索：将数据集转换为 DataFrame 格式，方便数据查看与操作。通过 print(df.head()) 查看数据的前几行。
特征关系可视化：通过 sns.pairplot() 展示各特征之间的关系图，按照不同的鸢尾花种类（Species）进行颜色区分，有助于理解特征之间的分布和相关性。

3.4 步骤 4：数据集划分

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

解释：

划分训练集和测试集：使用 train_test_split() 函数，将数据集按 70% 训练集和 30% 测试集划分。通过设置 random_state=42，确保划分结果是可重复的。

3.5 步骤 5：训练逻辑回归模型

# 训练逻辑回归模型
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

解释：

模型创建与训练：创建逻辑回归模型对象 LogisticRegression() 并在训练集上进行训练。max_iter=200 设置了最大迭代次数为 200，以确保模型能够收敛。

3.6 步骤 6：预测与评估模型

# 预测并评估模型
y_pred = model.predict(X_test)

# 混淆矩阵和分类报告
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=target_names, yticklabels=target_names)
plt.xlabel('预测值', fontproperties=font)
plt.ylabel('实际值', fontproperties=font)
plt.title('混淆矩阵', fontproperties=font)
plt.show()

print(classification_report(y_test, y_pred, target_names=target_names))

解释：

预测：使用训练好的模型在测试集上进行预测，得到预测标签 y_pred。
混淆矩阵：confusion_matrix() 生成混淆矩阵，展示模型预测结果和实际值的匹配情况。通过 sns.heatmap() 绘制热图，直观展示混淆矩阵的结果。
分类报告：通过 classification_report() 输出分类报告，包含精确率、召回率、F1 值等评估指标。

3.7 步骤 7：可视化逻辑回归决策边界

# 可视化逻辑回归决策边界（选取两个特征）
X_two_features = X[:, :2]  # 选择两个特征
X_train, X_test, y_train, y_test = train_test_split(X_two_features, y, test_size=0.3, random_state=42)
model.fit(X_train, y_train)

x_min, x_max = X_two_features[:, 0].min() - 1, X_two_features[:, 0].max() + 1
y_min, y_max = X_two_features[:, 1].min() - 1, X_two_features[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))

Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.Paired)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('花萼长度(cm)', fontproperties=font)
plt.ylabel('花萼宽度(cm)', fontproperties=font)
plt.title('逻辑回归决策边界', fontproperties=font)
plt.show()

解释：

选择两个特征：为了便于可视化，只选取前两个特征（花萼长度和花萼宽度）来绘制决策边界。
绘制决策边界：使用 np.meshgrid() 创建网格，通过模型对网格上的每个点进行预测，使用 contourf() 绘制决策边界。通过 scatter() 绘制测试集中样本点的分布情况，进一步展示模型在二维平面上的分类效果。

3.8 结果分析

模型评估：
- 混淆矩阵展示了模型在三类鸢尾花上的分类效果。大部分样本分类正确，显示了模型的良好表现。
- 分类报告提供了精确率、召回率、F1 值等关键指标，总体而言，模型在分类任务中的表现良好。
决策边界可视化：
- 决策边界清晰地将不同类别的鸢尾花分隔开来，展示了逻辑回归模型在二维特征空间中的分类效果。
- 样本点大多数落在正确的分类区域中，进一步验证了模型的分类能力。

4.总体代码和结果

4.1波士顿

4.1.1 波士顿代码

# 导入必要的库
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.font_manager import FontProperties

# 设置中文字体
font = FontProperties(fname=r'C:\Windows\Fonts\simhei.ttf')  # 替换为你的系统中的中文字体路径

# 加载本地数据
file_path = r'C:\Users\Administrator\Desktop\ML\机器学习\实验任务二\data\boston_housing.csv'
df = pd.read_csv(file_path)

# 数据探索
print(df.head())
print(df.describe())

# 特征与标签分离
X = df.drop('MEDV', axis=1)  # 特征
y = df['MEDV']  # 目标变量

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建线性回归模型并训练
model = LinearRegression()
model.fit(X_train, y_train)

# 预测并评估模型
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'均方误差（MSE）：{mse}')
print(f'R²值：{r2}')

# 图 5：可视化真实值与预测值的散点图
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)  # 理想预测线
plt.xlabel('实际房价', fontproperties=font)
plt.ylabel('预测房价', fontproperties=font)
plt.title('实际房价 vs 预测房价', fontproperties=font)
plt.show()

# 图 7：残差分布图
residuals = y_test - y_pred
plt.figure(figsize=(8, 6))
sns.histplot(residuals, kde=True, bins=30)
plt.axvline(residuals.mean(), color='red', linestyle='--', lw=2)
plt.xlabel('残差', fontproperties=font)
plt.ylabel('频数', fontproperties=font)
plt.title('残差分布', fontproperties=font)
plt.show()

# 图 10：特征相关性热力图
corr_matrix = df.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('特征相关性热力图', fontproperties=font)
plt.show()

# 图 12：特征重要性条形图
coefficients = pd.Series(model.coef_, index=X.columns)
coefficients = coefficients.sort_values(ascending=False)

plt.figure(figsize=(10, 6))
coefficients.plot(kind='barh')
plt.title('特征重要性排序', fontproperties=font)
plt.xlabel('系数值', fontproperties=font)
plt.ylabel('特征', fontproperties=font)
plt.show()

4.1.2 波士顿代码结果

4.3.3 波士顿房价预测模型结果分析

1. 模型评估

均方误差 (MSE)：模型的MSE较小，表示预测值与实际房价之间的差距较小，模型在一定程度上准确预测了房价。
R²值：R²值表明模型能解释房价变化的主要部分，虽然表现良好，但仍存在部分未解释的变化。整体上，模型对数据的拟合度较好。

2. 可视化分析

真实房价 vs 预测房价散点图：大多数数据点沿对角线分布，表明预测值与实际房价相对接近，模型具有一定的预测能力。但部分散点偏离较大，反映出模型在某些情况下存在预测误差。
残差分布图：残差接近正态分布，说明预测误差较均匀，没有系统性偏差。残差分析显示，模型适用于该数据集。

3. 特征相关性分析

相关性热力图：通过热力图可以看出不同特征之间的相关性。例如，房间数（RM） 与房价正相关，意味着房间数较多的房屋房价较高；而低收入人口比例（LSTAT） 与房价负相关，表示低收入比例越高，房价越低。

4. 特征重要性分析

NOX（一氧化氮浓度）：对房价的负面影响最大，说明环境污染对房价有显著的负面影响。
RM（房间数）：对房价的正面影响较大，房间数越多，房价越高。
LSTAT（低收入人口比例） 和 DIS（与就业中心的距离）：负面影响明显，说明社会经济因素和地理位置对房价有重要作用。

总体来看，模型能够较好地捕捉数据中的关键特征，并合理预测房价。

4.2 鸢尾花

4.2.1 鸢尾花代码

# 导入必要的库
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from matplotlib.font_manager import FontProperties

# 设置中文字体
font = FontProperties(fname=r'C:\Windows\Fonts\simhei.ttf')  # 修改为系统中中文字体的路径

# 加载鸢尾花数据集
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names

# 数据探索
df = pd.DataFrame(X, columns=feature_names)
df['Species'] = y
print(df.head())

# 特征关系可视化
sns.pairplot(df, hue='Species')
plt.show()

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 训练逻辑回归模型
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# 预测并评估模型
y_pred = model.predict(X_test)

# 混淆矩阵和分类报告
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=target_names, yticklabels=target_names)
plt.xlabel('预测值', fontproperties=font)
plt.ylabel('实际值', fontproperties=font)
plt.title('混淆矩阵', fontproperties=font)
plt.show()

print(classification_report(y_test, y_pred, target_names=target_names))

# 可视化逻辑回归决策边界（选取两个特征）
X_two_features = X[:, :2]  # 选择两个特征
X_train, X_test, y_train, y_test = train_test_split(X_two_features, y, test_size=0.3, random_state=42)
model.fit(X_train, y_train)

x_min, x_max = X_two_features[:, 0].min() - 1, X_two_features[:, 0].max() + 1
y_min, y_max = X_two_features[:, 1].min() - 1, X_two_features[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))

Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.Paired)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('花萼长度(cm)', fontproperties=font)
plt.ylabel('花萼宽度(cm)', fontproperties=font)
plt.title('逻辑回归决策边界', fontproperties=font)
plt.show()