Advertisement

基于机器学习的心脏病预测方法(4)——逻辑回归(Logistic Regression)

阅读量:

目录

    • 一、逻辑回归介绍
    • 二、核心代码
    • 三、评价指标
      • 3.1 混淆矩阵
      • 3.2 预测分数
      • 3.3 召回率
      • 3.4 F分数
      • 3.5 FN(false negative)
      • 3.6 ROC曲线和AUC指标

一、逻辑回归介绍

逻辑回归是一种分类算法,用于将观察值分配给一组离散类。与输出连续数值的线性回归不同,logistic回归使用logistic sigmoid函数对其输出进行变换,以返回一个概率值,然后该概率值可以映射到两个或多个离散类。
逻辑回归类型:

  • 二值(是/否)
  • 多类(猫/狗/羊)

损失函数:
\begin{array}{l} J\left( \theta \right) =\frac{1}{m}\sum_{i=1}^m{\mathrm{Cos}t\left( h_{\theta} \right( x^{\left( i \right)}),y^{\left( i \right)})}\\ \end{array}
\begin{array}{l}\mathrm{Cost}\left( h_{\theta} \right( x),y)=-\log \left( h_{\theta} \right( x\left) \right) \begin{matrix} & & y=1\\ \end{matrix}\\ \mathrm{Cost}\left( h_{\theta} \right( x),y)=-\log\mathrm{(}1-h_{\theta}\left( x \right) )\begin{matrix} & & y=0\\ \end{matrix}\\ \end{array}

J\left( \theta \right) =-\frac{1}{m}\sum_{i=1}^m{[y^{\left( i \right)}\log \left( h_{\theta} \right( x^{\left( i \right)}\left) \right) +(1-y^{\left( i \right)}\left) \log \right( 1-h_{\theta}\left( x^{\left( i \right)} \right) \left) \right]}

二、核心代码

首先需要导入相应库和数据集:

复制代码
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    import warnings
    warnings.filterwarnings('ignore')
    %matplotlib inline
    data = pd.read_csv('heart.csv', sep=',')
    data.head()
    
    
      
      
      
      
      
      
      
      
      
    
    代码解读

运行结果:
在这里插入图片描述
然后划分训练集和测试集(训练集80%,测试集20%):

复制代码
    from sklearn.model_selection import train_test_split
    
    predictors = data.drop("target",axis=1)
    target = data["target"]
    
    X_train,X_test,Y_train,Y_test = train_test_split(predictors,target,test_size=0.20,random_state=0)
    print("Training features have {0} records and Testing features have {1} records.".\
      format(X_train.shape[0], X_test.shape[0]))
    
    
      
      
      
      
      
      
      
      
    
    代码解读

运行结果:
Training features have 242 records and Testing features have 61 records.
在这里插入图片描述

逻辑回归核心代码如下:

复制代码
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score
    logreg = LogisticRegression()
    
    logreg.fit(X_train, Y_train)
    
    y_pred_lr = logreg.predict(X_test)
    print(y_pred_lr)
    
    
      
      
      
      
      
      
      
      
    
    代码解读

准确率:

复制代码
    score_lr = round(accuracy_score(y_pred_lr,Y_test)*100,2)
    
    print("逻辑回归准确率是: "+str(score_lr)+"%")
    
    
      
      
      
    
    代码解读

运行结果:
逻辑回归准确率是: 85.25%

或者自己设计训练模型:

复制代码
    #定义训练函数
    def train_model(x_train,y_train,x_test,y_test,classifier,**kwargs):
    """
    拟合模型并且打印准确率
    """
    model=classifier(**kwargs)
    model.fit(x_train,y_train)
    fit_accuracy=model.score(x_train,y_train)
    test_accuracy=model.score(x_test,y_test)
    
    print(f"Train accuracy:{fit_accuracy:0.2%}")
    print(f"Test accuracy:{test_accuracy:0.2%}")
    
    return model
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

利用逻辑回归:

复制代码
    # Logistic Regression
    from sklearn.linear_model import LogisticRegression
    model = train_model(X_train, Y_train, X_test, Y_test, LogisticRegression)
    
    
      
      
      
    
    代码解读

运行结果:
Train accuracy:84.71%
Test accuracy:85.25%

优化器:

复制代码
    clf = LogisticRegression(random_state=0, solver='newton-cg',
    ...                          multi_class='multinomial').fit(X_test, Y_test)
    clf.score(X_test, Y_test)
    
    
      
      
      
    
    代码解读

运行结果:
0.9344262295081968

三、评价指标

3.1 混淆矩阵

复制代码
    from sklearn.metrics import confusion_matrix
    matrix= confusion_matrix(Y_test, y_pred_lr)
    sns.heatmap(matrix,annot = True, fmt = "d")
    
    
      
      
      
    
    代码解读

运行结果:
在这里插入图片描述

3.2 预测分数

复制代码
    from sklearn.metrics import precision_score
    precision = precision_score(Y_test, y_pred_lr)
    print("Precision: ",precision)
    
    
      
      
      
    
    代码解读

运行结果:
Precision: 0.8571428571428571

3.3 召回率

复制代码
    from sklearn.metrics import recall_score
    recall = recall_score(Y_test, y_pred_lr)
    print("Recall is: ",recall)
    
    
      
      
      
    
    代码解读

运行结果:
Recall is: 0.8823529411764706

3.4 F分数

复制代码
    print((2*precision*recall)/(precision+recall))
    
    
      
    
    代码解读

运行结果:
0.8695652173913043

3.5 FN(false negative)

复制代码
    CM =pd.crosstab(Y_test, y_pred_lr)
    TN=CM.iloc[0,0]
    FP=CM.iloc[0,1]
    FN=CM.iloc[1,0]
    TP=CM.iloc[1,1]
    fnr=FN*100/(FN+TP)
    fnr
    
    
      
      
      
      
      
      
      
    
    代码解读

运行结果:
11.764705882352942

3.6 ROC曲线和AUC指标

复制代码
    from sklearn.metrics import roc_curve, auc
    
    y_proba=logreg.predict_proba(X_test)
    fpr, tpr, thresholds = roc_curve(Y_test,y_proba[:,1])
    
    
    fig, ax = plt.subplots()
    ax.plot(fpr, tpr)
    ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c=".3")
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])
    plt.rcParams['font.size'] = 12
    plt.title('ROC curve for diabetes classifier')
    plt.xlabel('False Positive Rate (1 - Specificity)')
    plt.ylabel('True Positive Rate (Sensitivity)')
    plt.grid(True)
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读

ROC曲线图如下:
在这里插入图片描述

查看AUC指标:

复制代码
    auc(fpr, tpr)
    
    
      
    
    代码解读

运行结果:
0.9074074074074073

全部评论 (0)

还没有任何评论哟~