Advertisement

决策树分类器的应用研究——乳腺癌诊断

阅读量:

决策树(Decision Tree)是在已知各种情况发生概率的基础上,通过构成决策树来求取净现值的期望值大于等于零的概率,评价项目风险,判断其可行性的决策分析方法,是直观运用概率分析的一种图解法。由于这种决策分支画成图形很像一棵树的枝干,故称决策树。在机器学习中,决策树是一个预测模型,他代表的是对象属性与对象值之间的一种映射关系。Entropy = 系统的凌乱程度,使用算法ID3, C4.5和C5.0生成树算法使用熵。这一度量是基于信息学理论中熵的概念。

复制代码
 %% 决策树分类器在乳腺癌诊断中的应用研究(2009a版本)

    
  
    
 %% 清空环境变量
    
 clear all
    
 clc
    
 warning off
    
  
    
 %% 导入数据
    
 load data.mat
    
 % 随机产生训练集/测试集
    
 a = randperm(569);
    
 Train = data(a(1:500),:);
    
 Test = data(a(501:end),:);
    
 % 训练数据
    
 P_train = Train(:,3:end);
    
 T_train = Train(:,2);
    
 % 测试数据
    
 P_test = Test(:,3:end);
    
 T_test = Test(:,2);
    
  
    
 %% 创建决策树分类器
    
 ctree = classregtree(P_train,T_train);
    
 % 查看决策树视图
    
 view(ctree);
    
  
    
 %% 仿真测试
    
 T_sim = eval(ctree,P_test);
    
  
    
 %% 结果分析
    
 count_B = length(find(T_train == 1));
    
 count_M = length(find(T_train == 2));
    
 rate_B = count_B / 500;
    
 rate_M = count_M / 500;
    
 total_B = length(find(data(:,2) == 1));
    
 total_M = length(find(data(:,2) == 2));
    
 number_B = length(find(T_test == 1));
    
 number_M = length(find(T_test == 2));
    
 number_B_sim = length(find(T_sim == 1 & T_test == 1));
    
 number_M_sim = length(find(T_sim == 2 & T_test == 2));
    
 disp(['病例总数:' num2str(569)...
    
       '  良性:' num2str(total_B)...
    
       '  恶性:' num2str(total_M)]);
    
 disp(['训练集病例总数:' num2str(500)...
    
       '  良性:' num2str(count_B)...
    
       '  恶性:' num2str(count_M)]);
    
 disp(['测试集病例总数:' num2str(69)...
    
       '  良性:' num2str(number_B)...
    
       '  恶性:' num2str(number_M)]);
    
 disp(['良性乳腺肿瘤确诊:' num2str(number_B_sim)...
    
       '  误诊:' num2str(number_B - number_B_sim)...
    
       '  确诊率p1=' num2str(number_B_sim/number_B*100) '%']);
    
 disp(['恶性乳腺肿瘤确诊:' num2str(number_M_sim)...
    
       '  误诊:' num2str(number_M - number_M_sim)...
    
       '  确诊率p2=' num2str(number_M_sim/number_M*100) '%']);
    
    
    
    
复制代码
 %% 决策树分类器在乳腺癌诊断中的应用研究(2012b版本)

    
  
    
 %% 清空环境变量
    
 clear all
    
 clc
    
 warning off
    
  
    
 %% 导入数据
    
 load data.mat
    
 % 随机产生训练集/测试集
    
 a = randperm(569);
    
 Train = data(a(1:500),:);
    
 Test = data(a(501:end),:);
    
 % 训练数据
    
 P_train = Train(:,3:end);
    
 T_train = Train(:,2);
    
 % 测试数据
    
 P_test = Test(:,3:end);
    
 T_test = Test(:,2);
    
  
    
 %% 创建决策树分类器
    
 ctree = ClassificationTree.fit(P_train,T_train);
    
 % 查看决策树视图
    
 view(ctree);
    
 view(ctree,'mode','graph');
    
  
    
 %% 仿真测试
    
 T_sim = predict(ctree,P_test);
    
  
    
 %% 结果分析
    
 count_B = length(find(T_train == 1));
    
 count_M = length(find(T_train == 2));
    
 rate_B = count_B / 500;
    
 rate_M = count_M / 500;
    
 total_B = length(find(data(:,2) == 1));
    
 total_M = length(find(data(:,2) == 2));
    
 number_B = length(find(T_test == 1));
    
 number_M = length(find(T_test == 2));
    
 number_B_sim = length(find(T_sim == 1 & T_test == 1));
    
 number_M_sim = length(find(T_sim == 2 & T_test == 2));
    
 disp(['病例总数:' num2str(569)...
    
       '  良性:' num2str(total_B)...
    
       '  恶性:' num2str(total_M)]);
    
 disp(['训练集病例总数:' num2str(500)...
    
       '  良性:' num2str(count_B)...
    
       '  恶性:' num2str(count_M)]);
    
 disp(['测试集病例总数:' num2str(69)...
    
       '  良性:' num2str(number_B)...
    
       '  恶性:' num2str(number_M)]);
    
 disp(['良性乳腺肿瘤确诊:' num2str(number_B_sim)...
    
       '  误诊:' num2str(number_B - number_B_sim)...
    
       '  确诊率p1=' num2str(number_B_sim/number_B*100) '%']);
    
 disp(['恶性乳腺肿瘤确诊:' num2str(number_M_sim)...
    
       '  误诊:' num2str(number_M - number_M_sim)...
    
       '  确诊率p2=' num2str(number_M_sim/number_M*100) '%']);
    
   
    
 %% 叶子节点含有的最小样本数对决策树性能的影响
    
 leafs = logspace(1,2,10);
    
  
    
 N = numel(leafs);
    
  
    
 err = zeros(N,1);
    
 for n = 1:N
    
     t = ClassificationTree.fit(P_train,T_train,'crossval','on','minleaf',leafs(n));
    
     err(n) = kfoldLoss(t);
    
 end
    
 plot(leafs,err);
    
 xlabel('叶子节点含有的最小样本数');
    
 ylabel('交叉验证误差');
    
 title('叶子节点含有的最小样本数对决策树性能的影响')
    
  
    
 %% 设置minleaf为28,产生优化决策树
    
 OptimalTree = ClassificationTree.fit(P_train,T_train,'minleaf',28);
    
 view(OptimalTree,'mode','graph')
    
  
    
 % 计算优化后决策树的重采样误差和交叉验证误差
    
 resubOpt = resubLoss(OptimalTree)
    
 lossOpt = kfoldLoss(crossval(OptimalTree))
    
 % 计算优化前决策树的重采样误差和交叉验证误差
    
 resubDefault = resubLoss(ctree)
    
 lossDefault = kfoldLoss(crossval(ctree))
    
  
    
 %% 剪枝
    
 [~,~,~,bestlevel] = cvLoss(ctree,'subtrees','all','treesize','min')
    
 cptree = prune(ctree,'Level',bestlevel);
    
 view(cptree,'mode','graph')
    
  
    
 % 计算剪枝后决策树的重采样误差和交叉验证误差
    
 resubPrune = resubLoss(cptree)
    
 lossPrune = kfoldLoss(crossval(cptree))
    
    
    
    

全部评论 (0)

还没有任何评论哟~