基于机器学习的心脏病预测方法(5)——随机森林(Random Forest)
发布时间
阅读量:
阅读量
目录
-
-
一、随机森林
-
- 1.1 随机森林介绍
- 1.2 随机森林算法介绍
- 1.3 随机森林预测伪代码
-
二、核心代码
-
三、评价指标
-
- 3.1 混淆矩阵
- 3.2 预测分数
- 3.3 召回率
- 3.4 F分数
- 3.5 FN(false negative)
- 3.6 ROC曲线
- 3.7 AUC
-
一、随机森林
1.1 随机森林介绍
随机森林是一种监督学习算法。随机森林可以用于分类和回归问题,通过使用随机森林回归器,我们可以在回归问题上使用随机森林。但是我们在这个项目中使用了随机森林分类,所以我们只考虑分类部分。
1.2 随机森林算法介绍
- 从总共m个特征中随机选择k个特征(k<m)
- 在k个特征中,使用最佳分割点计算节点d
- 使用最佳分割将节点分割为子节点
- 重复步骤1-3,直到达到1个节点
- 通过重复步骤1-4 n次来创建n个树构建森林
1.3 随机森林预测伪代码
- 获取测试特征并使用每个随机创建的决策树的规则来预测结果,并存储预测结果
- 计算每个预测目标的投票数
- 从随机森林算法中考虑最高投票预测目标作为最终预测值
二、核心代码
首先需要导入相应库和数据集:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
data = pd.read_csv('heart.csv', sep=',')
data.head()
运行结果:

然后划分训练集和测试集(训练集80%,测试集20%):
from sklearn.model_selection import train_test_split
predictors = data.drop("target",axis=1)
target = data["target"]
X_train,X_test,Y_train,Y_test = train_test_split(predictors,target,test_size=0.20,random_state=0)
print("Training features have {0} records and Testing features have {1} records.".\
format(X_train.shape[0], X_test.shape[0]))
运行结果:
Training features have 242 records and Testing features have 61 records.

随机森林核心代码如下:
from sklearn.ensemble import RandomForestClassifier
randfor = RandomForestClassifier(n_estimators=100, random_state=0)
randfor.fit(X_train, Y_train)
y_pred_rf = randfor.predict(X_test)#预测
学习曲线可视化:
from sklearn.model_selection import learning_curve
train_sizes, train_scores, test_scores = learning_curve(RandomForestClassifier(), X_train,
Y_train,cv=10,
scoring='accuracy',n_jobs=-1,
train_sizes=np.linspace(0.01, 1.0, 50))
#训练集分数的均值和方差
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
#测试集分数的均值和方差
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
# 绘制训练集分数和交叉验证集分数
plt.plot(train_sizes, train_mean, '--', color="#111111", label="Training score")
plt.plot(train_sizes, test_mean, color="#111111", label="Cross-validation score")
#
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, color="#DDDDDD")
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, color="#DDDDDD")
plt.title("Learning Curve")
plt.xlabel("Training Set Size"), plt.ylabel("Accuracy Score"), plt.legend(loc="best")
plt.tight_layout()
plt.show()
随机森林准确率:
score_rf = round(accuracy_score(y_pred_rf,Y_test)*100,2)
print("随机森林准确率是: "+str(score_rf)+" %")
运行结果:
随机森林准确率分数是88.52 %。
具有100棵树的随机森林:
#Random forest with 100 trees
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=0)
rf.fit(X_train, Y_train)
print("Accuracy on training set: {:.3f}".format(rf.score(X_train, Y_train)))
print("Accuracy on test set: {:.3f}".format(rf.score(X_test, Y_test)))
运行结果:
Accuracy on training set: 1.000
Accuracy on test set: 0.885
修建树的深度以检查准确性:
rf1 = RandomForestClassifier(max_depth=3, n_estimators=100, random_state=0)
rf1.fit(X_train, Y_train)
print("Accuracy on training set: {:.3f}".format(rf1.score(X_train, Y_train)))
print("Accuracy on test set: {:.3f}".format(rf1.score(X_test, Y_test)))
运行结果:
Accuracy on training set: 0.876
Accuracy on test set: 0.869
三、评价指标
3.1 混淆矩阵
from sklearn.metrics import confusion_matrix
matrix= confusion_matrix(Y_test, y_pred_rf)
sns.heatmap(matrix,annot = True, fmt = "d")
运行结果:

3.2 预测分数
from sklearn.metrics import precision_score
precision = precision_score(Y_test, y_pred_rf)
print("Precision: ",precision)
运行结果:
Precision: 0.909090909090909
3.3 召回率
from sklearn.metrics import recall_score
recall = recall_score(Y_test, y_pred_rf)
print("Recall is: ",recall)
运行结果:
Recall is: 0.8823529411764706
3.4 F分数
print((2*precision*recall)/(precision+recall))
运行结果:
0.8955223880597014
3.5 FN(false negative)
CM =pd.crosstab(Y_test, y_pred_rf)
TN=CM.iloc[0,0]
FP=CM.iloc[0,1]
FN=CM.iloc[1,0]
TP=CM.iloc[1,1]
fnr=FN*100/(FN+TP)
fnr
运行结果:
11.764705882352942
3.6 ROC曲线
from sklearn.metrics import roc_curve, auc
y_pred=randfor.predict(X_test)
y_proba=randfor.predict_proba(X_test)
fpr, tpr, thresholds = roc_curve(Y_test, y_proba[:,1])
fig, ax = plt.subplots()
ax.plot(fpr, tpr)
ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c=".3")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.rcParams['font.size'] = 12
plt.title('ROC curve for diabetes classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)
运行结果:
略
3.7 AUC
另一个常见的度量是曲线下面积(AUC)。这是一种方便的方法,可以用一个数字来捕捉模型的性能,尽管这并非没有某些问题。根据经验,AUC可分类如下:
0.90-1.00=优秀
0.80-0.90=良好
0.70-0.80=一般
0.60-0.70=差
0.50-0.60=失败
auc(fpr, tpr)
运行结果:
0.9389978213507625
全部评论 (0)
还没有任何评论哟~
