数据挖掘:R语言关联规则挖掘
发布时间
阅读量:
阅读量
文章目录
-
-
-
- 关联规则挖掘
-
-
关联规则挖掘
> data<-read.transactions('groceries.csv',sep = ',')
>
> summary(data) #可以查看输入信息的记录数,商品总数,最频繁的项集等数据
transactions as itemMatrix in sparse format with
9835 rows (elements/itemsets/transactions) and
169 columns (items) and a density of 0.02609146
most frequent items:
whole milk other vegetables
2513 1903
rolls/buns soda
1809 1715
yogurt (Other)
1372 34055
element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 6 7 8 9
2159 1643 1299 1005 855 645 545 438 350
10 11 12 13 14 15 16 17 18
246 182 117 78 77 55 46 29 14
19 20 21 22 23 24 26 27 28
14 9 11 4 6 1 1 1 1
29 32
3 1
Min. 1st Qu. Median Mean 3rd Qu.
000 2.000 3.000 4.409 6.000
Max.
32.000
includes extended item information - examples:
labels
1 abrasive cleaner
2 artif. sweetener
3 baby cosmetics
summary可以得到data的描述性统计信息
> #可以用inspect查看数据
> inspect(data[1:5])
items
[1] {citrus fruit,
margarine,
ready soups,
semi-finished bread}
[2] {coffee,
tropical fruit,
yogurt}
[3] {whole milk}
[4] {cream cheese,
meat spreads,
pip fruit,
yogurt}
[5] {condensed milk,
long life bakery product,
other vegetables,
whole milk}
> #若要研究某一特定的商品(一列数据),可以使用itemfrequency()函数查看
>
> itemFrequency(data[,1:3]) #给出的是前三个商品的支持度
abrasive cleaner artif. sweetener
0.0035587189 0.0032536858
baby cosmetics
0.0006100661
#可视化商品的支持度,可以使用itemfrequencyplot()
itemFrequencyPlot(data,support=0.1) #这里规定支持度至少为0.1

itemFrequencyPlot(data,topN=20) #根据支持度降序排列前20

> #可视化交易数据-----绘制稀疏矩阵 image()
>
> image(data[1:100])

> #==--------训练模型-----------------------------------------
> myrules<-apriori(data = data,parameter = list(support=0.006,confidence=0.25,minlen=2))
Apriori
Parameter specification:
confidence minval smax arem aval
0.25 0.1 1 none FALSE
originalSupport maxtime support minlen
TRUE 5 0.006 2
maxlen target ext
10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 59
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[169 item(s), 9835 transaction(s)] done [0.01s].
sorting and recoding items ... [109 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.01s].
writing ... [463 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
> summary(myrules)
set of 463 rules
rule length distribution (lhs + rhs):sizes
2 3 4
150 297 16
Min. 1st Qu. Median Mean 3rd Qu.
000 2.000 3.000 2.711 3.000
Max.
000
summary of quality measures:
support confidence
Min. :0.006101 Min. :0.2500
1st Qu.:0.007117 1st Qu.:0.2971
Median :0.008744 Median :0.3554
Mean :0.011539 Mean :0.3786
3rd Qu.:0.012303 3rd Qu.:0.4495
Max. :0.074835 Max. :0.6600
coverage lift
Min. :0.009964 Min. :0.9932
1st Qu.:0.018709 1st Qu.:1.6229
Median :0.024809 Median :1.9332
Mean :0.032608 Mean :2.0351
3rd Qu.:0.035892 3rd Qu.:2.3565
Max. :0.255516 Max. :3.9565
count
Min. : 60.0
1st Qu.: 70.0
Median : 86.0
Mean :113.5
3rd Qu.:121.0
Max. :736.0
mining info:
data ntransactions support confidence
data 9835 0.006 0.25
call
apriori(data = data, parameter = list(support = 0.006, confidence = 0.25, minlen = 2))
> inspect(myrules[1:3]) #查看前三条规则
lhs rhs
[1] {pot plants} => {whole milk}
[2] {pasta} => {whole milk}
[3] {herbs} => {root vegetables}
support confidence coverage
[1] 0.006914082 0.4000000 0.01728521
[2] 0.006100661 0.4054054 0.01504830
[3] 0.007015760 0.4312500 0.01626843
lift count
[1] 1.565460 68
[2] 1.586614 60
[3] 3.956477 69
> #==----------提高模型的性能-------------------------------------
>
> #根据不同标准对规则进行排序,提取出来,并进行整理
> #--1、对关联规则进行集合排序
> #-------通过lift排序:这里by只后可以换成 support confidence
> inspect(sort(myrules,by='lift')[1:5])
lhs rhs support confidence coverage lift count
[1] {herbs} => {root vegetables} 0.007015760 0.4312500 0.01626843 3.956477 69
[2] {berries} => {whipped/sour cream} 0.009049314 0.2721713 0.03324860 3.796886 89
[3] {other vegetables,
tropical fruit,
whole milk} => {root vegetables} 0.007015760 0.4107143 0.01708185 3.768074 69
[4] {beef,
other vegetables} => {root vegetables} 0.007930859 0.4020619 0.01972547 3.688692 78
[5] {other vegetables,
tropical fruit} => {pip fruit} 0.009456024 0.2634561 0.03589222 3.482649 93
> #--2、-----提取关联规则的子集:subset()
> #假设想要知道某一商品是否与其他商品一起被购买--这里以berries为例
>
> berriesRules<-subset(myrules,items %in% 'berries') #这里items项也可以用lhs,rhs来代替,将某一项固定在左边或者右边
> inspect(berriesRules)
lhs rhs support confidence coverage lift count
[1] {berries} => {whipped/sour cream} 0.009049314 0.2721713 0.0332486 3.796886 89
[2] {berries} => {yogurt} 0.010574479 0.3180428 0.0332486 2.279848 104
[3] {berries} => {other vegetables} 0.010269446 0.3088685 0.0332486 1.596280 101
[4] {berries} => {whole milk} 0.011794611 0.3547401 0.0332486 1.388328 116
> berriesRules<-subset(myrules,lhs %in% 'berries' & lift>2) #可以与与、或、非结合
> #提取规则时: items代表所有的规则;可以使用%pin%(部分匹配),%ain%(完全匹配);也可以加入支持度,置信度等条件
>
> #--3、将规则写入数据框或者文件中
> write(myrules,file = 'rules.csv',sep=',',quote=T,row.names=F) #写入文件,csv格式
> dat<-as(myrules,'data.frame') #变为数据框

全部评论 (0)
还没有任何评论哟~
