逻辑回归算法——乳腺癌检测

阅读量：

这个数据集总共从病灶造影图片中提取以下 10 个关键属性。
• radius 半径即病灶中心点离边界的平均距离
• texture 纹理，灰度值的标准偏差。
• perimeter 周长即病灶的大小
• area 积，也是反映病灶大小一个指标。
• smoothness 平滑度，即半径的变化幅度。
• actness ：密实度，周长的平方除以面积的商再减1
• concavity 凹度，凹陷部分轮廓的严重程度
• concave points 凹点凹陷轮廓的数量。
• symmetry 对称性。
• fractal dimension 分形维度
复合指标，是事物内在逻辑关系的体现

复制代码

 #载入数据

    
 from sklearn.datasets import load_breast_cancer
    
  
    
 cancer=load_breast_cancer()
    
 X=cancer.data
    
 y=cancer.target
    
 print('data shape:{0};no. posttive:{1};no. negative:{2}'.format(X.shape,y[y==1].shape[0],y[y==0].shape[0]))
    
 #data shape:(569, 30);no. posttive:357;no. negative:212

复制代码

 from sklearn.model_selection import train_test_split

    
 from sklearn.linear_model import LogisticRegression
    
 import numpy as np
    
  
    
 X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
    
 model=LogisticRegression()
    
 model.fit(X_train,y_train)
    
  
    
 train_score=model.score(X_train,y_train)
    
 test_score=model.score(X_test,y_test)
    
 print('train score:{0:.6f};test score:{1:.6f}'.format(train_score,test_score))
    
  
    
 #检查预测正确率
    
 y_pred=model.predict(X_test)
    
 print('matches:{0}/{1}'.format(np.equal(y_pred,y_test).shape[0],y_test.shape[0]))
    
    
    
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-08-17/ACgRG5i47hbjIslDMqtmZuWVXa03.png)

全部预测正确，为什么测试评分不是1呢？因为sklearn 是使用预测概率数据来计算模型评分的针对二元分类问题， Logistic Regression 模型会针对每个样本输出两个概率，即为0的概率和为1的概率，哪个概率高就预测为哪个类别。

复制代码

 #模型预测的自信度

    
 #找出自信度不足90%的模型预测
    
 y_pred_proba=model.predict_proba(X_test)
    
 result=y_pred_proba[y_pred_proba[:,0]>0.1]#y_pred_proba[:,0]>0.1获得布尔值，最后显示true值
    
 result[result[:,1]>0.1]

模型优化

复制代码

 #模型优化

    
 #用多项式特征
    
 from sklearn.linear_model import LogisticRegression
    
 from sklearn.preprocessing import PolynomialFeatures
    
 from sklearn.pipeline import Pipeline
    
 import time
    
  
    
 #约定俗成:
    
 #*args：可以理解为长度不固定的列表。
    
 #**kwarg：可以理解为长度不固定的字典
    
 def polynomial_model(degree=1,**kwarg):
    
     polynomial_features = PolynomialFeatures(degree=degree,include_bias=False)#include_bias：默认为 True 。如果为 True 的话，那么结果中就会有 0 次幂项
    
     logistic_reg=LogisticRegression(**kwarg)
    
     pipeline=Pipeline([('polynomial_features',polynomial_features),
    
                    ('logistic_reg',logistic_reg)])
    
     return pipeline
    
  
    
 model=polynomial_model(degree=2,penalty='l1',solver='liblinear')#penalty:惩罚 
    
 #决定惩罚项选择的有2个参数：dual和solver，如果要选L1范数，dual必须是False，solver必须是liblinear
    
 #L1范数作为正则项，可使参数稀疏化，即自动帮助我们选出那些对模型有关联的特征
    
  
    
 start=time.perf_counter()
    
 model.fit(X_train,y_train)
    
 poly_train_score=model.score(X_train,y_train)
    
 poly_test_score=model.score(X_test,y_test)
    
 print('elaspe:{0:.6f};poly_train_score:{1:.6f};poly_test_score:{2:.6f}'.format(time.perf_counter()-start,poly_train_score,poly_test_score))
    
    
    
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-08-17/xS8VERmDCtzrWLI7yFYhNg64J2Up.png)

复制代码

 #查看有多少特征没有被丢弃，即对应的参数西塔非0

    
 coefs=model.named_steps['logistic_reg'].coef_
    
 #named_steps 流水线中的一条流水线名
    
 #coef_ 属性里保存的就是模型参数
    
 print("总特征数:{0};非0特征数:{1}".format(coefs.shape, np.count_nonzero(coefs)))
    
 #输入特征由原来的 30个增加到了 495 个，最终大多数特征都被丢弃，只保留了86 个有效特

画学习曲线，找出使算法准确性最高的参数值

复制代码

 import numpy as np

    
 import matplotlib.pyplot as plt
    
 from sklearn.model_selection import learning_curve
    
  
    
 plt.rcParams['font.sans-serif'] = ['SimHei']#显示中文
    
  
    
 def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
    
                     n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    
     plt.figure()
    
     plt.title(title)
    
     if ylim is not None:
    
     plt.ylim(*ylim)
    
     plt.xlabel("Training examples")
    
     plt.ylabel("Score")
    
     train_sizes, train_scores, test_scores = learning_curve(
    
     estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    
     train_scores_mean = np.mean(train_scores, axis=1)
    
     train_scores_std = np.std(train_scores, axis=1)
    
     test_scores_mean = np.mean(test_scores, axis=1)
    
     test_scores_std = np.std(test_scores, axis=1)
    
     plt.grid()# 生成网格
    
     plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
    
                  train_scores_mean + train_scores_std, alpha=0.1,
    
                  color="r")
    
     plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
    
                  test_scores_mean + test_scores_std, alpha=0.1, color="g")
    
     plt.plot(train_sizes, train_scores_mean, 'o-', color="r",label="Training score")
    
     plt.plot(train_sizes, test_scores_mean, 'o-', color="g",label="Cross-validation score") 
    
     plt.legend(loc="best")#添加图例
    
     return plt
    
  
    
 from sklearn.model_selection import ShuffleSplit
    
 import time
    
  
    
 cv=ShuffleSplit(n_splits=10,test_size=0.2,random_state=0)
    
  
    
 #start=time.perf_counter()
    
 plot_learning_curve(polynomial_model(degree=1,penalty='l1',solver='liblinear'),"Learning Curves (degree=1 , penalty=l1)",X,y,ylim=(0.8,1.01),cv=cv)
    
 plot_learning_curve(polynomial_model(degree=2,penalty='l1',solver='liblinear'),"Learning Curves (degree=2 , penalty=l1)",X,y,ylim=(0.8,1.01),cv=cv)
    
 plot_learning_curve(polynomial_model(degree=1,penalty='l2',solver='lbfgs'),"Learning Curves (degree=1 , penalty=l2)",X,y,ylim=(0.8,1.01),cv=cv)
    
 plot_learning_curve(polynomial_model(degree=2,penalty='l2',solver='lbfgs'),"Learning Curves (degree=2 , penalty=l2)",X,y,ylim=(0.8,1.01),cv=cv)
    
    
    
    
    
![](https://ad.itadn.com/c/weblog/blog-img/images/2025-08-17/KJ079GLQqN6syAhOcRv8oZkCljDE.png)

针对大数据集怎样高效的画学习曲线？

答案很简单，我们可以从大数据集里选择部分数据来画学习曲线择好最优的模型之后使用全部的数据集来训练模型。有个地方需要注意，我们要尽量保持选择出来的这部分数据的标签分部与大数据集的标签分布相同。考虑分层抽样。

包有：

1、from sklearn.model_selection import StratifiedShuffleSplit

2、使用sklearn.model_selection.train_test_split，参数stratify即用来指定按照某一特征进行分层抽样，生成训练集和测试集。

全部评论 (0)

还没有任何评论哟~

逻辑回归算法——乳腺癌检测

这个数据集总共从病灶造影图片中提取以下10个关键属性。 •radius半径即病灶中心点离边界的平均距离 •texture纹理，灰度值的标准偏差。 •perimeter周长即病灶的大小 •area积，也...

sklearn实战：乳腺癌检测（逻辑回归算法）

%matplotlibinline importmatplotlib.pyplotasplt importnumpyasnp 代码解读载入数据 fromsklearn.datasetsimportl...

逻辑回归——乳腺癌分类

逻辑回归的数学原理 1. 逻辑回归是一种二分类的有监督学习算法。 2. 对数几率回归。 3. 损失函数采用最大似然函数的负对数。 4. 使用梯度下降算法找最优参数。

乳腺癌良恶性预测--逻辑回归

一、问题背景利用机器学习算法实现乳腺癌数据集的二分类问题，良恶性乳腺癌肿瘤预测。二、数据集分析乳腺癌数据集下载地址为：<https://archive.ics.uci.edu/ml/machin...

R 逻辑回归乳腺癌数据

71乳腺癌数据准备 loc<http://archive.ics.uci.edu/ml/machinelearningdatabases/ ds<breastcancerwisconsin/breas...

基于逻辑回归实现乳腺癌预测

✅作者简介：2022年博客新星第八。热爱国学的Java后端开发者，修心和技术同步精进。 🍎个人主页：JavaFans的博客 🍊个人信条：不迁怒，不贰过。小知识，大智慧。

基于逻辑回归算法实现乳腺癌识别

本篇文章将会设计并实现一个乳腺癌识别算法。首先一起来了解完成这样一项任务，算法的输入输出是什么，采取什么样的技术，以及算法实现的流程。本文使用sklearn.datasets下的乳腺癌数据集。通过分...

案例：逻辑回归Logistic Regression做乳腺癌预测

使用逻辑回归算法解决乳腺癌检测问题，使用sklearn自带的乳腺癌数据集 1导入数据 fromsklearn.datasetsimportloadbreastcancer cancer=loadbre...

机器学习--逻辑回归及乳腺癌预测

逻辑回归算法是用来解决分类问题的算法。逻辑回归二元分类算法原理参考：<http://blog.kamidox.com/logisticregression.html 逻辑回归模型由sklearn.l...

逻辑回归模型——乳腺癌数据集

逻辑回归模型——乳腺癌数据集导入数据集 fromsklearnimportdatasets importwarnings warnings.filterwarnings'ignore' df=dat...

是否确定退出登录?

逻辑回归算法——乳腺癌检测

全部评论 (0)

相关文章推荐

逻辑回归算法——乳腺癌检测

sklearn实战：乳腺癌检测（逻辑回归算法）

逻辑回归——乳腺癌分类

乳腺癌良恶性预测--逻辑回归

R 逻辑回归乳腺癌数据

基于逻辑回归实现乳腺癌预测

基于逻辑回归算法实现乳腺癌识别

案例：逻辑回归Logistic Regression做乳腺癌预测

机器学习--逻辑回归及乳腺癌预测

逻辑回归模型——乳腺癌数据集