数据挖掘(三)
文章目录
加载数据集
- 本例介绍如何预测NBA获胜球队。通常两支球队比赛难分胜负,有时最后一分钟才能定输赢,因此预测赢家很难。很多体育赛事的特点类似,预期的大赢家也许当天被另一支队伍给打败了。据研究表明,正确率因体育赛事而异,其上限在70%~80%之间。体育赛事预测多采用数据挖掘或统计学方法。
- 我们采集NBA2023赛事数据,将数据保存到csv文件中,使用pandas库加载、管理和处理数据。
import os
import numpy as np
import pandas as pd
data_filename = os.path.join(os.getcwd(), 'leagues_NBA_2024_games.csv')
# 加载数据集并输出数据集的前5行
results = pd.read_csv(data_filename)
print(results.iloc[:5])
- 从以上的前5行输出结果,可以看到日期是字符串格式,而不是日期对象。第一行没有数据,从视觉上检查结果,发现表头不完整或者不正确。这些问题来自数据,我们可以用pandas对原始数据进行预处理,清洗数据集。导入文件时指定参数修复数据的参数,导入后还可以修改文件的头部。编写预测算法前,定下一个正确率作为基准。每场比赛有主场队和客场队。最直接的方法就是拿几率作为基准,猜中的几率为50%。
results = pd.read_csv(data_filename, parse_dates=['Date'], usecols=[0, 6, 5, 2, 3, 4, 7])
results.columns = ['Date', 'Visitor Team', 'VisitorPts', 'Home Team', 'HomePts', 'Score Type', 'OT']
print(results.iloc[:5])
- 提取新特征,通过组合和比较现有数据抽取特征。首先确定类别值。在测试极端,那算法得到的分类结果与它比较,就能知道结果是否正确。我们用1表示主场队获胜,用0表示客场队获胜。
# 找出主场获胜的球队
results['HomeWin'] = results['VisitorPts'] < results['HomePts']
# scikit-learn可直接读取类别数据y_true
y_true = results['HomeWin'].values
print(results.iloc[:5])
print('主场获胜占比:{0:.1f}%'.format(100*results['HomeWin'].sum()/results['HomeWin'].count()))
- 我们还可以创建一些特征用于数据挖掘,首先创建两个能帮我们进行预测的特征,分别是两支队伍上场比赛的胜负情况。赢得上场比赛,大致可以说明该球队水平较高。遍历每行数据,记录获胜球队。当到达一行新数据时,分别查看该行数据中的两支球队在各自的上一场比赛中有没有获胜的。
from collections import defaultdict
# 存储上次比赛的结果,字典的键是球队,值为是否赢得上一场比赛
won_last = defaultdict(bool)
results['HomeLastWin'] = False
results['VisitorLastWin'] = False
#假定数据集是按照时间顺序排列的,如果不是则需要排序
for index, row in results.iterrows:
home_team = row['Home Team']
visitor_team = row['Visitor Team']
row['HomeLastWin'] = won_last[home_team]
row['VisitorLastWin'] = won_last[visitor_team]
results.iloc[index] = row
#用当前比赛的结果更新两支球队上场比赛的获胜结果
won_last[home_team] = row['HomeWin']
won_last[visitor_team] = not row['HomeWin']
#输出本赛季第20~25场比赛
print(results.iloc[20:25])
决策树
决策树中的参数
- min_samples_split选项可以指定创建一个新节点至少需要的个体数量,控制决策节点的创建。
- min_samples_leaf选项指定为了保留节点,每个节点至少要包含的个体数量,决定着决策节点能够被保留。
- 基尼不纯度用于衡量决策节点错误预测新个体类别的比例。
- 信息增益用信息论中的熵来表示决策节点提供多少新信息。
使用决策树
- 从
scikit-learn
库中导入DecisionTreeClassifier类创建决策树。
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
X_previouswins = results[['HomeLastWin', 'VisitorLastWin']].values
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_previouswins, y_true, scoring='accuracy')
print('正确率:{0:.1f}%'.format(np.mean(scores) * 100))
NBA比赛结果预测
- 我们从NBA战绩榜采集数据后将数据保存到csv文件中。
data_standings_filename = os.path.join(os.getcwd(), 'leagues_NBA_2023_standings.csv')
standings = pd.read_csv(data_standings_filename)
results['HomeTeamRanksHigher'] = 0
for index, row in results.iterrows():
home_team = row['Home Team']
visitor_team = row['Visitor Team']
home_rank = standings[standings['Team'] == home_team]['Rk'].values[0]
visitor_rank = standings[standings['Team'] == visitor_team]['Rk'].values[0]
#比较球队的排名更新特征值
row['HomeTeamRanksHigher'] = int(home_rank > visitor_rank)
results.iloc[index] = row
print(results.iloc[:5])
- 从数据集中抽取所需要的部分用cross_val_score函数测验结果,然后使用决策树分类器进行交叉验证求得正确率,发现正确率比之前的结果要好。
X_homehigher = results[['HomeLastWin', 'VisitorLastWin', 'HomeTeamRanksHigher']].values
scores = cross_val_score(clf, X_homehigher, y_true, scoring='accuracy')
print('正确率:{0:.1f}%'.format(np.mean(scores)*100))
- 我们统计两支球队上场比赛的情况作为另一个特征。虽然球队排名有助于预测,但有时排名靠后的球队反而能战胜排名靠前的。比如排名靠后的球队某些打法恰好能击中强者的软肋。
last_match_winner = defaultdict(int)
results['HomeTeamWonLast'] = 0
for index, row in results.iterrows():
home_team = row['Home Team']
visitor_team = row['Visitor Team']
# 不用考虑那支球队是主场作战,看下上一场比赛中谁是赢家,按照英文字母表顺序队球队名字排序
teams = tuple(sorted([home_team, visitor_team]))
row['HomeTeamWonLast'] = 1 if last_match_winner[teams] == row['Home Team'] else 0
results.iloc[index] = row
winner = row['Home Team'] if row['HomeWin'] else row['Visitor Team']
last_match_winner[teams] = winner
X_lastwinner = results[['HomeTeamRanksHigher', 'HomeTeamWonLast']].values
scores = cross_val_score(clf, X_lastwinner, y_true, scoring='accuracy')
print('正确率:{0:.1f}%'.format(np.mean(scores)*100))
- 最后再看一下决策树在训练数据量很大的情况下能否得到有效的分类模型。决策树能够处理特征值为类别行的数据,但是
scikit-learn
库所实现的决策树算法要求先对这类特征进行处理。用LabelEncoder转换器可以把字符串类型的球队名转化为整型。然后抽取所有比赛的主客场球队名将其组合起来形成一个矩阵。决策树可以用这些特征值进行训练。
from sklearn.preprocessing import LabelEncoder
encoding = LabelEncoder()
# 把主场球队名转化为整型
encoding.fit(results['Home Team'].values)
home_teams = encoding.transform(results['Home Team'].values)
visitor_teams = encoding.transform(results['Visitor Team'].values)
X_teams = np.vstack([home_teams, visitor_teams]).T
scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')
print('正确率:{0:.1f}%'.format(np.mean(scores)*100))
随机森林
- 创建的多棵决策树在很大程度上是相同的,每次使用相同的输入,将得到相同的输出。我们只有一个训练集,如果尝试创建多棵决策树,它们的输入就可能相同,因此输出也相同。解决方法是装袋,也就是每次随机从数据集中选取一部分数据用作训练。
- 用于前几个决策节点的特征的非常突出,我们随机选取部分数据用作训练集,创建的决策树相似性仍旧很大。解决办法是随机选取部分特征作为决策依据。
- 使用随机从数据集中选区的数据和随机选取的特征,创建多棵决策树,这就是随机森林。
决策树的集成效果
- 对随机森林中大量决策树的预测结果取均值,能有效降低方差,这样得到的预测模型的总体正确率更高。一般而言决策树集成做了假设,预测过程的误差具有随机性,且因分类器而异。使用多个模型得到的预测结果的均值,能够消除随机误差的影响,只保留正确的预测结果。
scikit-learn
库中的RandomForestClassifier是随机森林算法的实现,提供了一系列参数,Joblib库用于并行计算。
使用随机森林算法
scikit-learn
库实现的随机森林算法使用估计器接口,用交叉检验方法调用就可以。
from sklearn.ensemble import RandomForestClassifier
scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')
print('正确率:{0:.1f}%'.format(np.mean(scores)*100))
- 随机森林多用几个特征正确率可能会有提升,使用GridSearchCV类搜索最佳参数。
from sklearn.model_selection import cross_val_score, GridSearchCV
X_all = np.hstack([X_homehigher, X_teams])
scores = cross_val_score(clf, X_all, y_true, scoring='accuracy')
print('正确率:{0:.1f}%'.format(np.mean(scores)*100))
parameter_space = {
'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9]
}
grid = GridSearchCV(clf, parameter_space)
grid.fit(X_all, y_true)
print('正确率:{0:.1f}%'.format(grid.best_score_*100))
# 网格搜索找到的最佳模型用到了那些参数
print(grid.best_estimator_)
完整代码
python
import os
import numpy as np
import pandas as pd
from collections import defaultdict
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
data_filename = os.path.join(os.getcwd(), 'leagues_NBA_2024_games.csv')
data_standings_filename = os.path.join(os.getcwd(), 'leagues_NBA_2023_standings.csv')
standings = pd.read_csv(data_standings_filename)
# 加载数据集并输出数据集的前5行
results = pd.read_csv(data_filename, parse_dates=['Date'], usecols=[0, 6, 5, 2, 3, 4, 7])
results.columns = ['Date', 'Visitor Team', 'VisitorPts', 'Home Team', 'HomePts', 'Score Type', 'OT']
results['HomeWin'] = results['VisitorPts'] < results['HomePts']
y_true = results['HomeWin'].values
results.iloc[:5]
print('主场获胜占比:{0:.1f}%'.format(100*results['HomeWin'].sum()/results['HomeWin'].count()))
won_last = defaultdict(bool)
results['HomeLastWin'] = False
results['VisitorLastWin'] = False
for index, row in results.iterrows():
home_team = row['Home Team']
visitor_team = row['Visitor Team']
row['HomeLastWin'] = won_last[home_team]
row['VisitorLastWin'] = won_last[visitor_team]
results.iloc[index] = row
won_last[home_team] = row['HomeWin']
won_last[visitor_team] = not row['HomeWin']
results.iloc[20:25]
X_previouswins = results[['HomeLastWin', 'VisitorLastWin']].values
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_previouswins, y_true, scoring='accuracy')
print('正确率:{0:.1f}%'.format(np.mean(scores) * 100))
results['HomeTeamRanksHigher'] = 0
for index, row in results.iterrows():
home_team = row['Home Team']
visitor_team = row['Visitor Team']
home_rank = standings[standings['Team'] == home_team]['Rk'].values[0]
visitor_rank = standings[standings['Team'] == visitor_team]['Rk'].values[0]
row['HomeTeamRanksHigher'] = int(home_rank > visitor_rank)
results.iloc[index] = row
#print(results.iloc[:5])
X_homehigher = results[['HomeLastWin', 'VisitorLastWin', 'HomeTeamRanksHigher']].values
scores = cross_val_score(clf, X_homehigher, y_true, scoring='accuracy')
print('正确率:{0:.1f}%'.format(np.mean(scores)*100))
last_match_winner = defaultdict(int)
results['HomeTeamWonLast'] = 0
for index, row in results.iterrows():
home_team = row['Home Team']
visitor_team = row['Visitor Team']
# 不用考虑那支球队是主场作战,看下上一场比赛中谁是赢家,按照英文字母表顺序队球队名字排序
teams = tuple(sorted([home_team, visitor_team]))
row['HomeTeamWonLast'] = 1 if last_match_winner[teams] == row['Home Team'] else 0
results.iloc[index] = row
winner = row['Home Team'] if row['HomeWin'] else row['Visitor Team']
last_match_winner[teams] = winner
X_lastwinner = results[['HomeTeamRanksHigher', 'HomeTeamWonLast']].values
scores = cross_val_score(clf, X_lastwinner, y_true, scoring='accuracy')
print('正确率:{0:.1f}%'.format(np.mean(scores)*100))
encoding = LabelEncoder()
# 把主场球队名转化为整型
encoding.fit(results['Home Team'].values)
home_teams = encoding.transform(results['Home Team'].values)
visitor_teams = encoding.transform(results['Visitor Team'].values)
X_teams = np.vstack([home_teams, visitor_teams]).T
scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')
print('正确率:{0:.1f}%'.format(np.mean(scores)*100))
#更换分类器
clf = RandomForestClassifier(random_state=14)
scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')
print('正确率:{0:.1f}%'.format(np.mean(scores)*100))
X_all = np.hstack([X_homehigher, X_teams])
scores = cross_val_score(clf, X_all, y_true, scoring='accuracy')
print('正确率:{0:.1f}%'.format(np.mean(scores)*100))
parameter_space = {
'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9]
}
grid = GridSearchCV(clf, parameter_space)
grid.fit(X_all, y_true)
print('正确率:{0:.1f}%'.format(grid.best_score_*100))
# 网格搜索找到的最佳模型用到了那些参数
print(grid.best_estimator_)