Titanic: Machine Learning from Disaster
1、问题的提出
The approach of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic struck an iceberg during its first journey, leading to its tragic sinking and the loss of 1502 lives out of 2224 passengers and crew. This shocking event caused widespread concern across the globe and ultimately influenced improved safety regulations for ships.
Among the many reasons why the shipwreck resulted in such high casualties was inadequate provision of lifeboats for both passengers and crew. Despite a certain degree of luck playing a role in surviving the sinking, survival rates varied among different groups; notably, women, children, and upper-class individuals had higher chances of survival.
In this challenge, we are tasked with investigating what types of individuals were likely to survive. Specifically, we will employ machine learning techniques to forecast which passengers might have been saved during this tragic event.
(摘自https://www.kaggle.com/c/titanic)
具体而言,则是基于网站提供的与存活与否相关的train数据集来训练模型,并利用该模型对test数据中的乘客存活情况进行预测。随后将这些预测结果整理为包含'乘客ID, 是否存活'两列的一份CSV格式文件。
2、数据的图表
import pandas as pd
data = pd.read_csv("titanictrain.csv")
data.describe()

该部分展示了提供的train数据集的基本信息。其中PassengerID字段共有891个记录表示全部乘客数量,而Age字段仅有714个记录显示存在缺失值。
Survived_0 = data.Pclass[data.Survived == 0].value_counts()
Survived_1 = data.Pclass[data.Survived == 1].value_counts()
df=pd.DataFrame({'Survived':Survived_1, 'unSurvived':Survived_0})
df.plot(kind='bar', stacked=True)
plt.title("survived in pclass")
plt.xlabel("pclass")
plt.ylabel("persons")

该图表清晰对比了各阶级获救与未获救的比例:蓝色代表获救群体(约75%),橙色代表未获救群体(约25%)。从图表中可以看出,在各阶级中获得生存机会的比例呈现出显著差异:上层阶级的存活率远高于下层阶级;这表明,在数据预测模型中考虑pclass这一特征具有重要意义。
Survived_0 = data.Sex[data.Survived == 0].value_counts()
Survived_1 = data.Sex[data.Survived == 1].value_counts()
df=pd.DataFrame({'Survived':Survived_1, 'unSurvived':Survived_0})
df.plot(kind='bar', stacked=True)
plt.title("survived in Sex ")
plt.xlabel(" Sex ")
plt.ylabel("persons")

这里呈现了各性别在泰坦尼克号上的存活情况对比图(蓝色代表获救者),以及未获救者的存活比例(橙色表示)。从数据可以看出,在船难事件中女性乘客的存活率显著高于男性乘客(参考"ladyfirst"这一社会文化现象),这一现象充分体现了社会对女性至上的传统美德。因此性别的加入无疑是一个重要的影响因素
Survived_0 = data.Parch[data.Survived == 0].value_counts()
Survived_1 = data.Parch[data.Survived == 1].value_counts()
df=pd.DataFrame({'Survived':Survived_1, 'unSurvived':Survived_0})
df.plot(kind='bar', stacked=True)
plt.title("survived in Parch ")
plt.xlabel(" Parch ")
plt.ylabel("persons")
plt.show()

尽管不清楚parch指代什么(经过多种检索仍未能找到合适的对应翻译),但可以看出其存在一定的影响力。
Survived_0 = data.Embarked[data.Survived == 0].value_counts()
Survived_1 = data.Embarked[data.Survived == 1].value_counts()
df=pd.DataFrame({'Survived':Survived_1, 'unSurvived':Survived_0})
df.plot(kind='bar', stacked=True)
plt.title("survived in Embarked")
plt.xlabel("Fare")
plt.ylabel("persons")
plt.show()

在不同的港口上下船的人群中获得 saved status 的比例也存在差异。可能由于他们的背景或所处舱位的不同,在 same 船只的情况下, 对能否脱险的影响程度也有所不同.
(部分代码)
age_young = []
age_middle = []
age_old = []
for i in data.Age:
if i <=15 :
age_young.append(i)
elif i <=45:
age_middle.append(i)
elif i <=100:
age_old.append(i)
Age_Y = pd.DataFrame(age_young)
Age_M = pd.DataFrame(age_middle)
Age_O = pd.DataFrame(age_old)



鉴于年龄数据呈现高度分散性
Survived_0 = data.Fare[data.Survived == 0].value_counts()
Survived_1 = data.Fare[data.Survived == 1].value_counts()
df=pd.DataFrame({'Survived':Survived_1, 'unSurvived':Survived_0})
df.plot(kind='kde', stacked=True)
plt.title("survived in Fare")
plt.xlabel("Fare")
plt.ylabel("persons density")
plt.show()

从车费密度的角度来看,在车费密度较低的区间内(如0至10元),乘客数量相对较多,并且因此在这个区间内更容易获得救援。然而,在经济条件较为优越的情况下(如票价较高),个体的社会地位和社会资源积累可能会更高,并且其与救援比例之间可能存在一定的正相关性。
3、数据的训练处理
from sklearn.ensemble import RandomForestRegressor
def set_missing_ages(df):
age_df = df[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]
known_age = age_df[age_df.Age.notnull()].as_matrix()
unknown_age = age_df[age_df.Age.isnull()].as_matrix()
y = known_age[:, 0]
X = known_age[:, 1:]
# fit到RandomForestRegressor之中
rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)
rfr.fit(X, y)
# 用得到的模型进行未知年龄结果预测
predictedAges = rfr.predict(unknown_age[:, 1::])
# 用得到的预测结果填补原缺失数据
df.loc[ (df.Age.isnull()), 'Age' ] = predictedAges
return df, rfr
data, rfr = set_missing_ages(data)

在之前的部分中提到:仅包含714条关于年龄的数据(约为总人数的约19%)存在缺失现象;考虑到将这一因素纳入模型可能会对预测结果的质量产生较大影响;在此基础上构建随机森林模型;基于现有的数据集进行分析和处理;通过预测模型估算出可能存在的年龄值,并将其补充到数据集中;经过这一过程后;最终的数据量增加到了891条
import sklearn.preprocessing as preprocessing
dummies_Embarked = pd.get_dummies(data['Embarked'], prefix= 'Embarked')
dummies_Sex = pd.get_dummies(data['Sex'], prefix= 'Sex')
dummies_Pclass = pd.get_dummies(data['Pclass'], prefix= 'Pclass')
df = pd.concat([data, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)
df.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked','SibSp'], axis=1, inplace=True)
scaler = preprocessing.StandardScaler()
age_scale_param = scaler.fit(df['Age'])
df['Age_scaled'] = scaler.fit_transform(df['Age'], age_scale_param)
fare_scale_param = scaler.fit(df['Fare'])
df['Fare_scaled'] = scaler.fit_transform(df['Fare'], fare_scale_param)
首先将所给数据进行独热编码转换(便于理解的逻辑回归模型数据),然后去除不适用的数据(如姓名、兄弟数量等)以及原始未经过独热编码处理的数据。最后将乘客的年龄与票 Fare 进行标准化处理(缩放到-1至1范围),这可能会影响模型的收敛性(数学问题)。
from sklearn import linear_model
df.drop(['PassengerId', 'Age', 'Fare'], axis=1, inplace=True)
train_np = df.as_matrix()
y = train_np[:, 0]
X = train_np[:, 1:]
clf = linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-6) #调参数
clf.fit(X, y) #训练

首先移除无用数据(如 passengerid 等字段),然后在 numpy 中将 X 代表其他参数、y 代表是否存活的数值输入到逻辑回归模型中。通过合理配置参数训练出模型 clf 如图所示。
4、数据的测试处理
data_test = pd.read_csv("titanicttest.csv")
#处理成之前设计的补全年龄代码可以用的数据类型
data_test.loc[ (data_test.Fare.isnull()), 'Fare' ] = 0
tmp_df = data_test[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]
null_age = tmp_df[data_test.Age.isnull()].as_matrix()
#补上测试数据中缺失的年龄数据
X = null_age[:, 1:]
predictedAges = rfr.predict(X)
data_test.loc[ (data_test.Age.isnull()), 'Age' ] = predictedAges
#data_test = set_Cabin_type(data_test)
#转化成函数能读懂的独热编码
dummies_Embarked = pd.get_dummies(data_test['Embarked'], prefix= 'Embarked')
dummies_Sex = pd.get_dummies(data_test['Sex'], prefix= 'Sex')
dummies_Pclass = pd.get_dummies(data_test['Pclass'], prefix= 'Pclass')
df_test = pd.concat([data_test, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)
#去掉没用的数据,并且把年龄和fare标准化
df_test.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True)
df_test['Age_scaled'] = scaler.fit_transform(df_test['Age'], age_scale_param)
df_test['Fare_scaled'] = scaler.fit_transform(df_test['Fare'], fare_scale_param)
采用什么方法对训练数据进行处理,在测试数据上也需要遵循同样的方法原则;确保前后保持一致的数据项以及相应的数据类型。具体操作可参考代码中的注释部分。
df_test.drop(['PassengerId', 'Age', 'Fare','SibSp'], axis=1, inplace=True)
predictions = clf.predict(df_test)
result = pd.DataFrame({'PassengerId':data_test['PassengerId'], 'Survived':predictions})
result.to_csv("predictions.csv",index = False) # index = False 是不要第一列作为索引的意思


将处理后的df_test数据输入到 clf 模型中进行推断。然后存储于变量 predictions 中,并导出至 predictions.csv 文件。
参考:
1、http://www.cnblogs.com/zhizhan/p/5238908.html
2、<>(独热编码)
