基于机器学习的心脏病预测方法（5）——随机森林（Random Forest）

阅读量：

一、随机森林

1.1 随机森林介绍

随机森林是一种监督学习算法。随机森林可以用于分类和回归问题，通过使用随机森林回归器，我们可以在回归问题上使用随机森林。但是我们在这个项目中使用了随机森林分类，所以我们只考虑分类部分。

1.2 随机森林算法介绍

从总共m个特征中随机选择k个特征（k<m）
在k个特征中，使用最佳分割点计算节点d
使用最佳分割将节点分割为子节点
重复步骤1-3，直到达到1个节点
通过重复步骤1-4 n次来创建n个树构建森林

1.3 随机森林预测伪代码

获取测试特征并使用每个随机创建的决策树的规则来预测结果，并存储预测结果
计算每个预测目标的投票数
从随机森林算法中考虑最高投票预测目标作为最终预测值

二、核心代码

首先需要导入相应库和数据集：

复制代码

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    import warnings
    warnings.filterwarnings('ignore')
    %matplotlib inline
    data = pd.read_csv('heart.csv', sep=',')
    data.head()

运行结果：

然后划分训练集和测试集（训练集80%，测试集20%）：

复制代码

    from sklearn.model_selection import train_test_split
    
    predictors = data.drop("target",axis=1)
    target = data["target"]
    
    X_train,X_test,Y_train,Y_test = train_test_split(predictors,target,test_size=0.20,random_state=0)
    print("Training features have {0} records and Testing features have {1} records.".\
      format(X_train.shape[0], X_test.shape[0]))

运行结果：
Training features have 242 records and Testing features have 61 records.

随机森林核心代码如下：

复制代码

    from sklearn.ensemble import RandomForestClassifier
    randfor = RandomForestClassifier(n_estimators=100, random_state=0)
    
    randfor.fit(X_train, Y_train)
    
    y_pred_rf = randfor.predict(X_test)#预测

学习曲线可视化：

复制代码

    from sklearn.model_selection import learning_curve
    train_sizes, train_scores, test_scores = learning_curve(RandomForestClassifier(), X_train, 
                                                        Y_train,cv=10,
                                                        scoring='accuracy',n_jobs=-1, 
                                                        train_sizes=np.linspace(0.01, 1.0, 50))
    
    #训练集分数的均值和方差
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    
    #测试集分数的均值和方差
    test_mean = np.mean(test_scores, axis=1)
    test_std = np.std(test_scores, axis=1)
    
    # 绘制训练集分数和交叉验证集分数
    plt.plot(train_sizes, train_mean, '--', color="#111111",  label="Training score")
    plt.plot(train_sizes, test_mean, color="#111111", label="Cross-validation score")
    
    # 
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, color="#DDDDDD")
    plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, color="#DDDDDD")
    
    
    plt.title("Learning Curve")
    plt.xlabel("Training Set Size"), plt.ylabel("Accuracy Score"), plt.legend(loc="best")
    plt.tight_layout()
    plt.show()

随机森林准确率：

复制代码

    score_rf = round(accuracy_score(y_pred_rf,Y_test)*100,2)
    print("随机森林准确率是: "+str(score_rf)+" %")

运行结果：
随机森林准确率分数是88.52 %。

具有100棵树的随机森林：

复制代码

    #Random forest with 100 trees
    from sklearn.ensemble import RandomForestClassifier
    rf = RandomForestClassifier(n_estimators=100, random_state=0)
    rf.fit(X_train, Y_train)
    print("Accuracy on training set: {:.3f}".format(rf.score(X_train, Y_train)))
    print("Accuracy on test set: {:.3f}".format(rf.score(X_test, Y_test)))

运行结果：
Accuracy on training set: 1.000
Accuracy on test set: 0.885

修建树的深度以检查准确性：

复制代码

    rf1 = RandomForestClassifier(max_depth=3, n_estimators=100, random_state=0)
    rf1.fit(X_train, Y_train)
    print("Accuracy on training set: {:.3f}".format(rf1.score(X_train, Y_train)))
    print("Accuracy on test set: {:.3f}".format(rf1.score(X_test, Y_test)))

运行结果：
Accuracy on training set: 0.876
Accuracy on test set: 0.869

三、评价指标

3.1 混淆矩阵

复制代码

    from sklearn.metrics import confusion_matrix
    matrix= confusion_matrix(Y_test, y_pred_rf)
    sns.heatmap(matrix,annot = True, fmt = "d")

运行结果：

3.2 预测分数

复制代码

    from sklearn.metrics import precision_score
    precision = precision_score(Y_test, y_pred_rf)
    print("Precision: ",precision)

运行结果：
Precision: 0.909090909090909

3.3 召回率

复制代码

    from sklearn.metrics import recall_score
    recall = recall_score(Y_test, y_pred_rf)
    print("Recall is: ",recall)

运行结果：
Recall is: 0.8823529411764706

3.4 F分数

复制代码

    print((2*precision*recall)/(precision+recall))

运行结果：
0.8955223880597014

3.5 FN（false negative）

复制代码

    CM =pd.crosstab(Y_test, y_pred_rf)
    TN=CM.iloc[0,0]
    FP=CM.iloc[0,1]
    FN=CM.iloc[1,0]
    TP=CM.iloc[1,1]
    fnr=FN*100/(FN+TP)
    fnr

运行结果：
11.764705882352942

3.6 ROC曲线

复制代码

    from sklearn.metrics import roc_curve, auc
    y_pred=randfor.predict(X_test)
    y_proba=randfor.predict_proba(X_test)
    fpr, tpr, thresholds = roc_curve(Y_test, y_proba[:,1])
    
    fig, ax = plt.subplots()
    ax.plot(fpr, tpr)
    ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c=".3")
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])
    plt.rcParams['font.size'] = 12
    plt.title('ROC curve for diabetes classifier')
    plt.xlabel('False Positive Rate (1 - Specificity)')
    plt.ylabel('True Positive Rate (Sensitivity)')
    plt.grid(True)

运行结果：
略

3.7 AUC

另一个常见的度量是曲线下面积（AUC）。这是一种方便的方法，可以用一个数字来捕捉模型的性能，尽管这并非没有某些问题。根据经验，AUC可分类如下：
0.90-1.00=优秀
0.80-0.90=良好
0.70-0.80=一般
0.60-0.70=差
0.50-0.60=失败

复制代码

    auc(fpr, tpr)

运行结果：
0.9389978213507625

全部评论 (0)

还没有任何评论哟~

基于机器学习的心脏病预测方法（5）——随机森林（Random Forest）

目录一、随机森林 1.1随机森林介绍 1.2随机森林算法介绍 1.3随机森林预测伪代码二、核心代码三、评价指标 3.1混淆矩阵 3.2预测分数 3.3召回率 3.4F分数 3.5FN（false...

机器学习：随机森林（Random Forest）

随机森林，RandomForest，简称RF，是一个很强大的模型。要研究随机森林，首先要研究决策树，然后再去看RF是怎么通过多颗决策树的集成提高的模型效果。

机器学习——随机森林（Random Forest）

目录 1什么是随机森林？ 2随机森林的特点 3随机森林的相关基础知识 4随机森林的生成 5随机森林的随机性体现： 6袋外错误率（ooberror） 7随机森林的Python实现 1什么是随机森林？随...

机器学习：随机森林（Random Forest）

机器学习深度学习NLP搜索推荐等索引目录本博客参考邹博机器学习课件以及周志华的《机器学习》，仅用于督促自己学习使用，如有错误，欢迎大家提出更正。ps:前面几篇主要关注的是机器学习理论部分，后续博文会...

机器学习——随机森林（Random Forest)

1、随机森林（randomforest）简介随机森林是一种集成算法（EnsembleLearning），它属于Bagging类型，通过组合多个弱分类器，最终结果通过投票或取均值，使得整体模型的结果具...

机器学习之random forest（随机森林）

随机森林什么是随机森林？在讲解随机森林之前，首先我们要了解什么叫集成学习集成学习集成学习通过建立几个模型组合的来解决单一预测问题。它的工作原理是生成多个分类器/模型，各自独立地学习和作出预测。...

【机器学习4】随机森林 Random Forest

1\.介绍定义：RandomForest可以视为若干棵DecisionTree的Ensemble集成。好处：随机森林比一般的决策树，具有更小的方差和variance，是目前应用最广法、且分类效果最...

随机森林（Random Forest）的学习

1.Bagging思想 Bagging是bootstrapaggregating。思想就是从总体样本当中随机取一部分样本进行训练，通过多次这样的结果，进行投票获取平均值作为结果输出，这就极大可能的避免...

机器学习5—分类算法之随机森林（Random Forest）

随机森林（RandomForest）前言一、随机森林 1.什么是随机森林 2.随机森林的特点 3.随机森林的生成二、随机森林的函数模型三、随机森林算法实现 1.数据的读取 2.数据的清洗和填充...

sklearn 随机森林_随机森林Random Forest

Author:Leao Time:2020.10.9 随机森林是一种集成学习算法,可集成指定树的数量，有参数nestimators importsklearn sklearn.version skle...

是否确定退出登录?

基于机器学习的心脏病预测方法（5）——随机森林（Random Forest）

目录

一、随机森林

1.1 随机森林介绍

1.2 随机森林算法介绍

1.3 随机森林预测伪代码

二、核心代码

三、评价指标

3.1 混淆矩阵

3.2 预测分数

3.3 召回率

3.4 F分数

3.5 FN（false negative）

3.6 ROC曲线

3.7 AUC

全部评论 (0)

相关文章推荐

基于机器学习的心脏病预测方法（5）——随机森林（Random Forest）

机器学习：随机森林（Random Forest）

机器学习——随机森林（Random Forest）

机器学习：随机森林（Random Forest）

机器学习——随机森林（Random Forest)

机器学习之random forest（随机森林）

【机器学习4】随机森林 Random Forest

随机森林（Random Forest）的学习

机器学习5—分类算法之随机森林（Random Forest）

sklearn 随机森林_随机森林Random Forest