Advertisement

数据挖掘:R语言关联规则挖掘

阅读量:

文章目录

        • 关联规则挖掘
关联规则挖掘
复制代码
    > data<-read.transactions('groceries.csv',sep = ',')
    > 
    > summary(data)  #可以查看输入信息的记录数,商品总数,最频繁的项集等数据
    transactions as itemMatrix in sparse format with
     9835 rows (elements/itemsets/transactions) and
     169 columns (items) and a density of 0.02609146 
    
    most frequent items:
      whole milk other vegetables 
            2513             1903 
      rolls/buns             soda 
            1809             1715 
          yogurt          (Other) 
            1372            34055 
    
    element (itemset/transaction) length distribution:
    sizes
       1    2    3    4    5    6    7    8    9 
    2159 1643 1299 1005  855  645  545  438  350 
      10   11   12   13   14   15   16   17   18 
     246  182  117   78   77   55   46   29   14 
      19   20   21   22   23   24   26   27   28 
      14    9   11    4    6    1    1    1    1 
      29   32 
       3    1 
    
       Min. 1st Qu.  Median    Mean 3rd Qu. 
000   2.000   3.000   4.409   6.000 
       Max. 
     32.000 
    
    includes extended item information - examples:
            labels
    1 abrasive cleaner
    2 artif. sweetener
    3   baby cosmetics

summary可以得到data的描述性统计信息

复制代码
    > #可以用inspect查看数据
    > inspect(data[1:5])
    items                      
    [1] {citrus fruit,             
     margarine,                
     ready soups,              
     semi-finished bread}      
    [2] {coffee,                   
     tropical fruit,           
     yogurt}                   
    [3] {whole milk}               
    [4] {cream cheese,             
     meat spreads,             
     pip fruit,                
     yogurt}                   
    [5] {condensed milk,           
     long life bakery product, 
     other vegetables,         
     whole milk}
复制代码
    > #若要研究某一特定的商品(一列数据),可以使用itemfrequency()函数查看
    > 
    > itemFrequency(data[,1:3])   #给出的是前三个商品的支持度
    abrasive cleaner artif. sweetener 
    0.0035587189     0.0032536858 
      baby cosmetics 
    0.0006100661
复制代码
    #可视化商品的支持度,可以使用itemfrequencyplot()
    
    itemFrequencyPlot(data,support=0.1)  #这里规定支持度至少为0.1
在这里插入图片描述
复制代码
    itemFrequencyPlot(data,topN=20)   #根据支持度降序排列前20
在这里插入图片描述
复制代码
    > #可视化交易数据-----绘制稀疏矩阵  image()
    > 
    > image(data[1:100])
在这里插入图片描述
复制代码
    > #==--------训练模型-----------------------------------------
    > myrules<-apriori(data = data,parameter = list(support=0.006,confidence=0.25,minlen=2))
    Apriori
    
    Parameter specification:
     confidence minval smax arem  aval
       0.25    0.1    1 none FALSE
     originalSupport maxtime support minlen
            TRUE       5   0.006      2
     maxlen target  ext
     10  rules TRUE
    
    Algorithmic control:
     filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE
    
    Absolute minimum support count: 59 
    
    set item appearances ...[0 item(s)] done [0.00s].
    set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
    sorting and recoding items ... [109 item(s)] done [0.00s].
    creating transaction tree ... done [0.00s].
    checking subsets of size 1 2 3 4 done [0.01s].
    writing ... [463 rule(s)] done [0.00s].
    creating S4 object  ... done [0.00s].
复制代码
    > summary(myrules)
    set of 463 rules
    
    rule length distribution (lhs + rhs):sizes
      2   3   4 
    150 297  16 
    
       Min. 1st Qu.  Median    Mean 3rd Qu. 
000   2.000   3.000   2.711   3.000 
       Max. 
000 
    
    summary of quality measures:
    support           confidence    
     Min.   :0.006101   Min.   :0.2500  
     1st Qu.:0.007117   1st Qu.:0.2971  
     Median :0.008744   Median :0.3554  
     Mean   :0.011539   Mean   :0.3786  
     3rd Qu.:0.012303   3rd Qu.:0.4495  
     Max.   :0.074835   Max.   :0.6600  
    coverage             lift       
     Min.   :0.009964   Min.   :0.9932  
     1st Qu.:0.018709   1st Qu.:1.6229  
     Median :0.024809   Median :1.9332  
     Mean   :0.032608   Mean   :2.0351  
     3rd Qu.:0.035892   3rd Qu.:2.3565  
     Max.   :0.255516   Max.   :3.9565  
     count      
     Min.   : 60.0  
     1st Qu.: 70.0  
     Median : 86.0  
     Mean   :113.5  
     3rd Qu.:121.0  
     Max.   :736.0  
    
    mining info:
     data ntransactions support confidence
     data          9835   0.006       0.25
                                                                                   call
     apriori(data = data, parameter = list(support = 0.006, confidence = 0.25, minlen = 2))
复制代码
    > inspect(myrules[1:3])  #查看前三条规则
    lhs             rhs              
    [1] {pot plants} => {whole milk}     
    [2] {pasta}      => {whole milk}     
    [3] {herbs}      => {root vegetables}
    support     confidence coverage  
    [1] 0.006914082 0.4000000  0.01728521
    [2] 0.006100661 0.4054054  0.01504830
    [3] 0.007015760 0.4312500  0.01626843
    lift     count
    [1] 1.565460 68   
    [2] 1.586614 60   
    [3] 3.956477 69
复制代码
    > #==----------提高模型的性能-------------------------------------
    > 
    > #根据不同标准对规则进行排序,提取出来,并进行整理
    > #--1、对关联规则进行集合排序
    > #-------通过lift排序:这里by只后可以换成 support confidence
    > inspect(sort(myrules,by='lift')[1:5])
    lhs                    rhs                      support confidence   coverage     lift count
    [1] {herbs}             => {root vegetables}    0.007015760  0.4312500 0.01626843 3.956477    69
    [2] {berries}           => {whipped/sour cream} 0.009049314  0.2721713 0.03324860 3.796886    89
    [3] {other vegetables,                                                                          
     tropical fruit,                                                                            
     whole milk}        => {root vegetables}    0.007015760  0.4107143 0.01708185 3.768074    69
    [4] {beef,                                                                                      
     other vegetables}  => {root vegetables}    0.007930859  0.4020619 0.01972547 3.688692    78
    [5] {other vegetables,                                                                          
     tropical fruit}    => {pip fruit}          0.009456024  0.2634561 0.03589222 3.482649    93
复制代码
    > #--2、-----提取关联规则的子集:subset()
    > #假设想要知道某一商品是否与其他商品一起被购买--这里以berries为例
    > 
    > berriesRules<-subset(myrules,items %in% 'berries')   #这里items项也可以用lhs,rhs来代替,将某一项固定在左边或者右边
    > inspect(berriesRules)
    lhs          rhs                      support confidence  coverage     lift count
    [1] {berries} => {whipped/sour cream} 0.009049314  0.2721713 0.0332486 3.796886    89
    [2] {berries} => {yogurt}             0.010574479  0.3180428 0.0332486 2.279848   104
    [3] {berries} => {other vegetables}   0.010269446  0.3088685 0.0332486 1.596280   101
    [4] {berries} => {whole milk}         0.011794611  0.3547401 0.0332486 1.388328   116
复制代码
    > berriesRules<-subset(myrules,lhs %in% 'berries' & lift>2)  #可以与与、或、非结合
    > #提取规则时: items代表所有的规则;可以使用%pin%(部分匹配),%ain%(完全匹配);也可以加入支持度,置信度等条件
    > 
    > #--3、将规则写入数据框或者文件中
    > write(myrules,file = 'rules.csv',sep=',',quote=T,row.names=F)  #写入文件,csv格式
    > dat<-as(myrules,'data.frame') #变为数据框
在这里插入图片描述

全部评论 (0)

还没有任何评论哟~