数据读取
分别读取两个csv文件中的数据合并到一个数组中
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
PassengerId = test['PassengerId']
all_data = pd.concat([train, test], ignore_index=True)
数据可视化分析
通过打印alldata数组,可以看到test和train合并之后的数组数据
再通过discribe函数看出整体数据的特征及分布
由图可知数据集初步可分为11个特征和1个标签,且各特征值的总和、均值、标准差、最值、上下四分位点一目了然。
查看变量类型以及数量,分析得:
1.full数据集共1309行
2.浮点型变量有3个,整型变量4个,字符型5个
3.Survived列为标签,1代表获救,0代表遇难
4.Age\Cabin\Embarked\Fare\Name\Parch\PassengerId\Pclass\Sex\SibSip\Ticket11列为特征
5.Age\Cabin\Embarked\Fare数据有缺失
接下来将train中的数据可视化,分析几个特征和幸存之间的关系
可视化处理
填充缺失数据
1.Age Feature:Age缺失量为263,缺失量较大,用Sex, Title, Pclass三个特征构建随机森林模型,填充年龄缺失值。特征提取并转化。
age_df = all_data[['Age', 'Pclass', 'Sex', 'Title']]
age_df = pd.get_dummies(age_df)
known_age = age_df[age_df.Age.notnull()].as_matrix()
unknown_age = age_df[age_df.Age.isnull()].as_matrix()
y = known_age[:, 0]
X = known_age[:, 1:]
rfr = RandomForestRegressor(random_state=0, n_estimators=100, n_jobs=-1)
rfr.fit(X, y)
predictedAges = rfr.predict(unknown_age[:, 1::])
all_data.loc[(all_data.Age.isnull()), 'Age'] = predictedAges
2.Embarked Feature:Embarked缺失量为2,缺失Embarked信息的乘客的Pclass均为1,且Fare均为80,因为Embarked为C且Pclass为1的乘客的Fare中位数为80,所以缺失值填充为C。
all_data['Embarked'] = all_data['Embarked'].fillna('C')
3.Fare Feature:Fare缺失量为1,缺失Fare信息的乘客的Embarked为S,Pclass为3,所以用Embarked为S,Pclass为3的乘客的Fare中位数填充。
fare=all_data[(all_data['Embarked'] == "S") & (all_data['Pclass'] == 3)].Fare.median()
all_data['Fare']=all_data['Fare'].fillna(fare)
特征选择
把姓氏相同的乘客划分为同一组,从人数大于一的组中分别提取出每组的妇女儿童和成年男性。
all_data['Surname']=all_data['Name'].apply(lambda x:x.split(',')[0].strip())
Surname_Count = dict(all_data['Surname'].value_counts())
all_data['FamilyGroup'] = all_data['Surname'].apply(lambda x:Surname_Count[x])
Female_Child_Group=all_data.loc[(all_data['FamilyGroup']>=2) & ((all_data['Age']<=12) | (all_data['Sex']=='female'))]
Male_Adult_Group=all_data.loc[(all_data['FamilyGroup']>=2) & (all_data['Age']>12) & (all_data['Sex']=='male')]
发现绝大部分女性和儿童组的平均存活率都为1或0,即同组的女性和儿童要么全部幸存,要么全部遇难。
Female_Child=pd.DataFrame(Female_Child_Group.groupby('Surname')['Survived'].mean().value_counts())
Female_Child.columns=['GroupCount']
查看Female_Child
In[4]: Female_Child
Out[4]:
GroupCount
1.000000 115
0.000000 31
0.750000 2
0.333333 1
0.142857 1
绝大部分成年男性组的平均存活率也为1或0。
Male_Adult=pd.DataFrame(Male_Adult_Group.groupby('Surname')['Survived'].mean().value_counts())
Male_Adult.columns=['GroupCount']
查看Male_Adult
Male_Adult
Out[5]:
GroupCount
0.000000 122
1.000000 20
0.500000 6
0.333333 2
0.250000 1
因为普遍规律是女性和儿童幸存率高,成年男性幸存较低,所以我们把不符合普遍规律的反常组选出来单独处理。把女性和儿童组中幸存率为0的组设置为遇难组,把成年男性组中存活率为1的设置为幸存组,推测处于遇难组的女性和儿童幸存的可能性较低,处于幸存组的成年男性幸存的可能性较高。
Female_Child_Group = Female_Child_Group.groupby('Surname')['Survived'].mean()
Dead_List = set(Female_Child_Group[Female_Child_Group.apply(lambda x: x == 0)].index)
Male_Adult_List = Male_Adult_Group.groupby('Surname')['Survived'].mean()
Survived_List = set(Male_Adult_List[Male_Adult_List.apply(lambda x: x == 1)].index)
为了使处于这两种反常组中的样本能够被正确分类,对测试集中处于反常组中的样本的Age,Title,Sex进行惩罚修改。
test.loc[(test['Surname'].apply(lambda x: x in Dead_List)), 'Sex'] = 'male'
test.loc[(test['Surname'].apply(lambda x: x in Dead_List)), 'Age'] = 60
test.loc[(test['Surname'].apply(lambda x: x in Dead_List)), 'Title'] = 'Mr'
test.loc[(test['Surname'].apply(lambda x: x in Survived_List)), 'Sex'] = 'female'
test.loc[(test['Surname'].apply(lambda x: x in Survived_List)), 'Age'] = 5
test.loc[(test['Surname'].apply(lambda x: x in Survived_List)), 'Title'] = 'Miss'
选取特征,转换为数值变量,划分训练集和测试集。
all_data = pd.concat([train, test])
all_data = all_data[
['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'Embarked', 'Title', 'FamilyLabel', 'Deck', 'TicketGroup']]
all_data = pd.get_dummies(all_data)
train = all_data[all_data['Survived'].notnull()]
test = all_data[all_data['Survived'].isnull()].drop('Survived', axis=1)
X = train.as_matrix()[:, 1:]
y = train.as_matrix()[:, 0]
分类器构建
通过可视化方法,直观展示其中的分析过程。
from sklearn.pipeline import make_pipeline
select = SelectKBest(k = 20)
clf = RandomForestClassifier(random_state = 10, warm_start = True,
n_estimators = 26,
max_depth = 6,
max_features = 'sqrt')
pipeline = make_pipeline(select, clf)
pipeline.fit(X, y)
结果
Pipeline(memory=None,
steps=[('selectkbest', SelectKBest(k=20, score_func=<function f_classif at 0x000000000C8AE048>)), ('randomforestclassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=6, max_features='sqrt', max_leaf_nodes=None,
min_impurity_decreas...estimators=26, n_jobs=1,
oob_score=False, random_state=10, verbose=0, warm_start=True))])
模型验证
交叉验证
from sklearn.model_selection import cross_val_score
cv_score = cross_val_score(pipeline, X, y, cv=10)
print("CV Score : Mean - %.7g | Std - %.7g " % (np.mean(cv_score), np.std(cv_score)))
结果
CV Score : Mean - 0.8451402 | Std - 0.03276752
预测
predictions = pipeline.predict(test)
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": predictions.astype(np.int32)})
submission.to_csv("submission.csv", index=False)
完整代码
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
warnings.filterwarnings('ignore')
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
PassengerId = test['PassengerId']
all_data = pd.concat([train, test], ignore_index=True)
# print(all_data)
# print(all_data.describe())
survive = train[train['Survived'] == 1]
die = train[train['Survived'] == 0]
sexsurvive = survive['Sex'].value_counts()
sexdie = die['Sex'].value_counts()
df = pd.DataFrame({'Survived': sexsurvive, 'Unsurvived': sexdie})
df.plot(kind='bar', stacked=True)
plt.title('sex and survive')
plt.xlabel('sex')
plt.ylabel('survivor')
plt.show()
agesurvive = survive['Age'].value_counts()
agedie = die['Age'].value_counts()
df2 = pd.DataFrame({'Survived': agesurvive, 'Unsurvived': agedie})
df2.dropna().plot( stacked=True)
plt.title('age and survive')
plt.xlabel('age')
plt.ylabel('survivor')
plt.show()
all_data['Title'] = all_data['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
Title_Dict = {}
Title_Dict.update(dict.fromkeys(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer'))
Title_Dict.update(dict.fromkeys(['Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty'))
Title_Dict.update(dict.fromkeys(['Mme', 'Ms', 'Mrs'], 'Mrs'))
Title_Dict.update(dict.fromkeys(['Mlle', 'Miss'], 'Miss'))
Title_Dict.update(dict.fromkeys(['Mr'], 'Mr'))
Title_Dict.update(dict.fromkeys(['Master', 'Jonkheer'], 'Master'))
all_data['Title'] = all_data['Title'].map(Title_Dict)
all_data['FamilySize'] = all_data['SibSp'] + all_data['Parch'] + 1
Pclasssurvive=survive['Pclass'].value_counts()
Pclassdie=die['Pclass'].value_counts()
df3=pd.DataFrame({'Survived':Pclasssurvive,'Unsurvived':Pclassdie})
df3.plot(kind='bar', stacked=True)
plt.title('Pclass and survive')
plt.xlabel('Pclass')
plt.ylabel('number of people')
plt.show()
Embarksurvive=survive['Embarked'].value_counts()
Embarkdie=die['Embarked'].value_counts()
df4=pd.DataFrame({'Survived':Embarksurvive,'Unsurvived':Embarkdie})
df4.plot(kind='bar', stacked=True)
plt.title('embark and survive')
plt.xlabel('Embark')
plt.ylabel('number of people')
plt.show()
data_Pclass1=survive[survive['Pclass']==1]
data_Pclass2=survive[survive['Pclass']==2]
data_Pclass3=survive[survive['Pclass']==3]
data_Pclass1['Fare'].plot(kind='kde',color='r')
data_Pclass2['Fare'].dropna().plot(kind='kde',color='g')
data_Pclass3['Fare'].dropna().plot(kind='kde')
plt.title('Fare in survive')
plt.legend(['first class','second class','third class'])
plt.xlim(0,150)
plt.show()
def Fam_label(s):
if (s >= 2) & (s <= 4):
return 2
elif ((s > 4) & (s <= 7)) | (s == 1):
return 1
elif (s > 7):
return 0
all_data['FamilyLabel'] = all_data['FamilySize'].apply(Fam_label)
all_data['Cabin'] = all_data['Cabin'].fillna('Unknown')
all_data['Deck'] = all_data['Cabin'].str.get(0)
Ticket_Count = dict(all_data['Ticket'].value_counts())
all_data['TicketGroup'] = all_data['Ticket'].apply(lambda x: Ticket_Count[x])
def Ticket_Label(s):
if (s >= 2) & (s <= 4):
return 2
elif ((s > 4) & (s <= 8)) | (s == 1):
return 1
elif (s > 8):
return 0
all_data['TicketGroup'] = all_data['TicketGroup'].apply(Ticket_Label)
age_df = all_data[['Age', 'Pclass', 'Sex', 'Title']]
age_df = pd.get_dummies(age_df)
known_age = age_df[age_df.Age.notnull()].values
unknown_age = age_df[age_df.Age.isnull()].values
y = known_age[:, 0]
X = known_age[:, 1:]
rfr = RandomForestRegressor(random_state=0, n_estimators=100, n_jobs=-1)
rfr.fit(X, y)
predictedAges = rfr.predict(unknown_age[:, 1::])
all_data.loc[(all_data.Age.isnull()), 'Age'] = predictedAges
all_data['Embarked'] = all_data['Embarked'].fillna('C')
fare = all_data[(all_data['Embarked'] == "S") & (all_data['Pclass'] == 3)].Fare.median()
all_data['Fare'] = all_data['Fare'].fillna(fare)
all_data['Surname'] = all_data['Name'].apply(lambda x: x.split(',')[0].strip())
Surname_Count = dict(all_data['Surname'].value_counts())
all_data['FamilyGroup'] = all_data['Surname'].apply(lambda x: Surname_Count[x])
Female_Child_Group = all_data.loc[
(all_data['FamilyGroup'] >= 2) & ((all_data['Age'] <= 12) | (all_data['Sex'] == 'female'))]
Male_Adult_Group = all_data.loc[(all_data['FamilyGroup'] >= 2) & (all_data['Age'] > 12) & (all_data['Sex'] == 'male')]
Female_Child = pd.DataFrame(Female_Child_Group.groupby('Surname')['Survived'].mean().value_counts())
Female_Child.columns = ['GroupCount']
Male_Adult = pd.DataFrame(Male_Adult_Group.groupby('Surname')['Survived'].mean().value_counts())
Male_Adult.columns = ['GroupCount']
train = all_data.loc[all_data['Survived'].notnull()]
test = all_data.loc[all_data['Survived'].isnull()]
Female_Child_Group = Female_Child_Group.groupby('Surname')['Survived'].mean()
Dead_List = set(Female_Child_Group[Female_Child_Group.apply(lambda x: x == 0)].index)
Male_Adult_List = Male_Adult_Group.groupby('Surname')['Survived'].mean()
Survived_List = set(Male_Adult_List[Male_Adult_List.apply(lambda x: x == 1)].index)
test.loc[(test['Surname'].apply(lambda x: x in Dead_List)), 'Sex'] = 'male'
test.loc[(test['Surname'].apply(lambda x: x in Dead_List)), 'Age'] = 60
test.loc[(test['Surname'].apply(lambda x: x in Dead_List)), 'Title'] = 'Mr'
test.loc[(test['Surname'].apply(lambda x: x in Survived_List)), 'Sex'] = 'female'
test.loc[(test['Surname'].apply(lambda x: x in Survived_List)), 'Age'] = 5
test.loc[(test['Surname'].apply(lambda x: x in Survived_List)), 'Title'] = 'Miss'
all_data = pd.concat([train, test])
all_data = all_data[
['Survived', 'Pclass', 'Sex', 'Age', 'Fare', 'Embarked', 'Title', 'FamilyLabel', 'Deck', 'TicketGroup']]
all_data = pd.get_dummies(all_data)
train = all_data[all_data['Survived'].notnull()]
test = all_data[all_data['Survived'].isnull()].drop('Survived', axis=1)
X = train.values[:, 1:]
y = train.values[:, 0]
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest
pipe = Pipeline([('select', SelectKBest(k=20)),
('classify', RandomForestClassifier(random_state=10, max_features='sqrt'))])
param_test = {'classify__n_estimators': list(range(20, 50, 2)),
'classify__max_depth': list(range(3, 60, 3))}
gsearch = GridSearchCV(estimator=pipe, param_grid=param_test, scoring='roc_auc', cv=10)
gsearch.fit(X, y)
from sklearn.pipeline import make_pipeline
select = SelectKBest(k=20)
clf = RandomForestClassifier(random_state=10, warm_start=True,
n_estimators=26,
max_depth=6,
max_features='sqrt')
pipeline = make_pipeline(select, clf)
pipeline.fit(X, y)
from sklearn.model_selection import cross_val_score
cv_score = cross_val_score(pipeline, X, y, cv=10)
print("CV Score : Mean - %.7g | Std - %.7g " % (np.mean(cv_score), np.std(cv_score)))
predictions = pipeline.predict(test)
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": predictions.astype(np.int32)})
submission.to_csv("submission.csv", index=False)