基于机器学习的心脏病预测方法(4)——逻辑回归(Logistic Regression)
目录
-
- 一、逻辑回归介绍
- 二、核心代码
- 三、评价指标
-
- 3.1 混淆矩阵
- 3.2 预测分数
- 3.3 召回率
- 3.4 F分数
- 3.5 FN(false negative)
- 3.6 ROC曲线和AUC指标
一、逻辑回归介绍
逻辑回归是一种分类算法,用于将观察值分配给一组离散类。与输出连续数值的线性回归不同,logistic回归使用logistic sigmoid函数对其输出进行变换,以返回一个概率值,然后该概率值可以映射到两个或多个离散类。
逻辑回归类型:
- 二值(是/否)
- 多类(猫/狗/羊)
损失函数:
\begin{array}{l} J\left( \theta \right) =\frac{1}{m}\sum_{i=1}^m{\mathrm{Cos}t\left( h_{\theta} \right( x^{\left( i \right)}),y^{\left( i \right)})}\\ \end{array}
\begin{array}{l}\mathrm{Cost}\left( h_{\theta} \right( x),y)=-\log \left( h_{\theta} \right( x\left) \right) \begin{matrix} & & y=1\\ \end{matrix}\\ \mathrm{Cost}\left( h_{\theta} \right( x),y)=-\log\mathrm{(}1-h_{\theta}\left( x \right) )\begin{matrix} & & y=0\\ \end{matrix}\\ \end{array}
即
J\left( \theta \right) =-\frac{1}{m}\sum_{i=1}^m{[y^{\left( i \right)}\log \left( h_{\theta} \right( x^{\left( i \right)}\left) \right) +(1-y^{\left( i \right)}\left) \log \right( 1-h_{\theta}\left( x^{\left( i \right)} \right) \left) \right]}
二、核心代码
首先需要导入相应库和数据集:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
data = pd.read_csv('heart.csv', sep=',')
data.head()
代码解读
运行结果:

然后划分训练集和测试集(训练集80%,测试集20%):
from sklearn.model_selection import train_test_split
predictors = data.drop("target",axis=1)
target = data["target"]
X_train,X_test,Y_train,Y_test = train_test_split(predictors,target,test_size=0.20,random_state=0)
print("Training features have {0} records and Testing features have {1} records.".\
format(X_train.shape[0], X_test.shape[0]))
代码解读
运行结果:
Training features have 242 records and Testing features have 61 records.

逻辑回归核心代码如下:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
y_pred_lr = logreg.predict(X_test)
print(y_pred_lr)
代码解读
准确率:
score_lr = round(accuracy_score(y_pred_lr,Y_test)*100,2)
print("逻辑回归准确率是: "+str(score_lr)+"%")
代码解读
运行结果:
逻辑回归准确率是: 85.25%
或者自己设计训练模型:
#定义训练函数
def train_model(x_train,y_train,x_test,y_test,classifier,**kwargs):
"""
拟合模型并且打印准确率
"""
model=classifier(**kwargs)
model.fit(x_train,y_train)
fit_accuracy=model.score(x_train,y_train)
test_accuracy=model.score(x_test,y_test)
print(f"Train accuracy:{fit_accuracy:0.2%}")
print(f"Test accuracy:{test_accuracy:0.2%}")
return model
代码解读
利用逻辑回归:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
model = train_model(X_train, Y_train, X_test, Y_test, LogisticRegression)
代码解读
运行结果:
Train accuracy:84.71%
Test accuracy:85.25%
优化器:
clf = LogisticRegression(random_state=0, solver='newton-cg',
... multi_class='multinomial').fit(X_test, Y_test)
clf.score(X_test, Y_test)
代码解读
运行结果:
0.9344262295081968
三、评价指标
3.1 混淆矩阵
from sklearn.metrics import confusion_matrix
matrix= confusion_matrix(Y_test, y_pred_lr)
sns.heatmap(matrix,annot = True, fmt = "d")
代码解读
运行结果:

3.2 预测分数
from sklearn.metrics import precision_score
precision = precision_score(Y_test, y_pred_lr)
print("Precision: ",precision)
代码解读
运行结果:
Precision: 0.8571428571428571
3.3 召回率
from sklearn.metrics import recall_score
recall = recall_score(Y_test, y_pred_lr)
print("Recall is: ",recall)
代码解读
运行结果:
Recall is: 0.8823529411764706
3.4 F分数
print((2*precision*recall)/(precision+recall))
代码解读
运行结果:
0.8695652173913043
3.5 FN(false negative)
CM =pd.crosstab(Y_test, y_pred_lr)
TN=CM.iloc[0,0]
FP=CM.iloc[0,1]
FN=CM.iloc[1,0]
TP=CM.iloc[1,1]
fnr=FN*100/(FN+TP)
fnr
代码解读
运行结果:
11.764705882352942
3.6 ROC曲线和AUC指标
from sklearn.metrics import roc_curve, auc
y_proba=logreg.predict_proba(X_test)
fpr, tpr, thresholds = roc_curve(Y_test,y_proba[:,1])
fig, ax = plt.subplots()
ax.plot(fpr, tpr)
ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c=".3")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.rcParams['font.size'] = 12
plt.title('ROC curve for diabetes classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)
代码解读
ROC曲线图如下:

查看AUC指标:
auc(fpr, tpr)
代码解读
运行结果:
0.9074074074074073
