Association Rule Mining

  • Uploaded by: Allison Collier
  • 0
  • 0
  • July 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Association Rule Mining as PDF for free.

More details

  • Words: 1,857
  • Pages: 34
Association rule mining Lecture 30/ 06-10-09

L30/ 06-10-09

1

Fk-1 X F1 Method

– Extend each frequent (k-1) itemset with other frequent items. – E.g. {beer, diapers} can be merged with {bread}, that will result in 3-itemset. – This procedure is complete b’coz every frequent k-itemset is generated with the help of frequent (k-1)-itemset and a frequent 1itemset. L30/ 06-10-09

2

One drawback of this method is that it doesn’t prevent the same candidate itemset from being generated more than once. {bread, diapers, milk} can be obtained in no. of ways.

L30/ 06-10-09

3

• If the items are kept sorted in lexicographic order in each frequent itemset, then this problem can be avoided. • E.g. {bread, diapers} can be merged with {milk} but {diapers, milk} with {bread} must not be possible. Or {bread, milk} with {diapers} is not possible.

L30/ 06-10-09

4

Fk-1 X Fk-1 Method • Merges a pair of frequent (k-1) itemsets only if their first k-2 items are identical. • Let A={a1,a2,…..ak-1} and B={b1,b2,…bk-1} be a pair of frequent (k-1) itemsets. • A and B will be merged if ai=bi (for i=1,2,….k-2) and ak-1=/bk-1.

L30/ 06-10-09

5

This example shows the completeness of candidate generation procedure and advantage of using lexicographic ordering to avoid duplicate generation of the candidate itemset.

L30/ 06-10-09

6



Frequent itemset generation of Apriori Algorithm 1: k=1



2: Fk=

• •

3: repeat 4: k=k+1 5: Ck=ap-gen(Fk-1).

• •

{i | i ∈ I ∧ σ ({i} ≥ N × min sup}. {find all frequent 1-itemsets}



6: 7:



8:

• • • •

9: 10: 11: 12:



13: until Fk=



14: Result =



{generates candidate itemsets}

for each transaction t T do Ct=subset(Ck,t). {Identify all candidates that belong to t} for each candidate itemset c

σ (c ) = σ ( c ) + 1



Ct do {increment support count}

endfor endfor {c | c ∈ C k ∧ σ (c) ≥ N × min sup} F k= {find all frequent k-itemsets}

φ

 Fk

Step 6 to 11 are computing support count L30/ 06-10-09

7

Support Counting • Support counting- determining the frequency of occurrence of each candidate itemset after candidate generation as well as candidate pruning step of “ap-gen” function. • Fig 6.2 explains one way of counting the support of each candidate itemset.

This approach is computationally expensive.

L30/ 06-10-09

8

Alternate approach • Enumerate the itemsets contained in each transaction. • Then determine whether each enumerated 3itemset corresponds to an existing candidate itemset. • If there is a match with a candidate then support count of that candidate is incremented. • The operation of matching the candidate itemset with enumerated 3-itemset can be done using hash tree structure. L30/ 06-10-09

9

Each itemset keeps its items in increasing lexicographic order

The prefix structure shown in this figure explains how systematic enumeration of the itemsets contained in a transaction can be done. L30/ 06-10-09

10

Support counting using hash tree • Candidate itemsets are divided into different buckets and stored in a hash tree. • For support counting, itemsets contained in a transaction are hashed into their appropriate buckets. • In this way, each itemset is matched only with candidate itemsets that belong to the same bucket.

L30/ 06-10-09

11

L30/ 06-10-09

12

Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent null

Maximal Itemsets

A

B

C

D

E

AB

AC

AD

AE

BC

BD

BE

CD

CE

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

Infrequent Itemsets

ABCD

ABCE

L30/ 06-10-09

ABDE

ACDE

Border BCDE 13

Closed Itemset • An itemset is closed if none of its immediate supersets has the same support count as the itemset TID 1 2 3 4 5

Items {A,B} {B,C,D} {A,B,C,D} {A,B,D} {A,B,C,D}

Itemset {A} {B} {C} {D} {A,B} {A,C} {A,D} {B,C} {B,D} {C,D} L30/ 06-10-09

Support 4 5 3 4 4 2 3 3 4 3

Itemset {A,B,C} {A,B,D} {A,C,D} {B,C,D} {A,B,C,D}

Support 2 3 2 3 2

14

Maximal vs Closed Itemsets Transaction Ids

null

TID

Item s

1

ABC

2

ABCD

3

BCE

4

ACDE

5

DE

124

123

A

12

124

AB

12

24

AC

ABC

B

AE

24 ABD

ABE

2

345 D

2 BC

3

BD

4 ACD

245

C

123

4

AD

2

1234

2

4 ACE

BE

ADE

BCD

E

24

3

CD

BCE

34

CE

BDE

45

4

4 ABCD

ABCE

Not supported by any transactions

ABDE

ACDE

BCDE

ABCDE

L30/ 06-10-09

15

DE

CDE

Maximal vs Closed Frequent Itemsets Closed but not maximal

Minimum support = 2

null

124

123

A

12

124

AB

12 ABC

24

AC

B

AE

24 ABD

ABE

345 D

2 BC

3

BD

4 ACD

245

C

123

4

AD

2

1234

ACE

BE

2

4 ADE

Closed and maximal

BCD

E

24

3

CD

BCE

34

CE

BDE

45

4

DE

CDE

# Closed = 9 2

# Maximal = 4

4 ABCD

ABCE

ABDE

ACDE

L30/ 06-10-09 ABCDE

BCDE

16

Maximal vs Closed Itemsets Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets

L30/ 06-10-09

17

Hash func. H(p)=p mod 3

L30/ 06-10-09

18

L30/ 06-10-09

19

Reducing Number of Comparisons

• Candidate counting:

– Scan the database of transactions to determine the support of each candidate itemset – To reduce the number of comparisons, store the candidates in a hash structure • Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

Transactions

N

TID 1 2 3 4 5

Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke L30/ 06-10-09 Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke

Hash Structure

k 20

Generate Hash Tree Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: • Hash function • Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)

Hash function 3,6,9 1,4,7

234 567 345 136

145

2,5,8 124 457

125 458

L30/ 06-10-09

159

356 357 689

367 368

21

Association Rule Discovery: Hash tree Candidate Hash Tree

Hash Function

1,4,7

3,6,9

2,5,8

234 567 145

136 345

Hash on 1, 4 or 7 124 457

125 458

159

L30/ 06-10-09

356 357 689

367 368

22

Association Rule Discovery: Hash tree Candidate Hash Tree

Hash Function

1,4,7

3,6,9

2,5,8

234 567 145

136 345

Hash on 2, 5 or 8 124 457

125 458

159

L30/ 06-10-09

356 357 689

367 368

23

Association Rule Discovery: Hash tree Candidate Hash Tree

Hash Function

1,4,7

3,6,9

2,5,8

234 567 145

136 345

Hash on 3, 6 or 9 124 457

125 458

159

L30/ 06-10-09

356 357 689

367 368

24

Subset Operation Given a transaction t, what are the possible subsets of size 3?

Transaction, t 1 2 3 5 6

Level 1 1 2 3 5 6

2 3 5 6

Level 2 12 3 5 6

13 5 6

123 125 126

135 136

15

L30/ 06-10-09

6

156

23 5 6

25 6

235 236

256 25

3

Subset Operation Using Hash Tree 1 2 3 5 6 transaction Hash Function

1+ 2356

2+ 356

1,4,7

3+ 56

3,6,9

2,5,8

234 567 145

136 345

124 457

125 458

159

356 357 689

L30/ 06-10-09

367 368

26

Subset Operation Using Hash Tree

Hash Function

1 2 3 5 6 transaction

1+ 2356

2+ 356

12+ 356

1,4,7

3+ 56

3,6,9

2,5,8

13+ 56 234 567

15+ 6 145

136 345

124 457

125 458

159

L30/ 06-10-09

356 357 689

367 368

27

Subset Operation Using Hash Tree

Hash Function

1 2 3 5 6 transaction

1+ 2356

2+ 356

12+ 356

1,4,7

3+ 56

3,6,9

2,5,8

13+ 56 234 567

15+ 6 145

136 345

124 457

125 458

159

356 357 689

367 368

Match transaction against 11 out of 15 candidates L30/ 06-10-09

28

Factors Affecting Complexity

• Choice of minimum support threshold

– lowering support threshold results in more frequent itemsets – this may increase number of candidates and max length of frequent itemsets

• Dimensionality (number of items) of the data set – more space is needed to store support count of each item – if number of frequent items also increases, both computation and I/O costs may also increase

• Size of database – since Apriori makes multiple passes, run time of algorithm may increase with number of transactions

• Average transaction width – transaction width increases with denser data sets – This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)

L30/ 06-10-09

29



Compact Representation of Frequent Itemsets Some itemsets are redundant because they have identical support as their supersets

TID A1 A2 A3 A4 1 1 1 1 1 1 1 1 1 2 10  = 3×   k  1 1 1 1 3

• Number of frequent itemsets

• Need a compact representation

10

∑ k =1

L30/ 06-10-09

30

Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent null Maximal Itemsets

A

B

C

D

E

AB

AC

AD

AE

BC

BD

BE

CD

CE

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

Infrequent Itemsets ABCD

ABCE

L30/ 06-10-09

ABDE

ACDE

Border

BCDE

31

Closed Itemset • An itemset is closed if none of its immediate supersets has the same support as the itemset

TID 1 2 3 4 5

Items {A,B} {B,C,D} {A,B,C,D} {A,B,D} {A,B,C,D}

Itemset {A} {B} {C} {D} {A,B} {A,C} {A,D} {B,C} {B,D} L30/ 06-10-09 {C,D}

Support 4 5 3 4 4 2 3 3 4 3

Itemset {A,B,C} {A,B,D} {A,C,D} {B,C,D} {A,B,C,D}

Support 2 3 2 3 2

32

Maximal vs. Closed Itemsets TID

Item s

1

ABC

2

ABCD

3

BCE

4

ACDE

5

DE

Transaction Ids

null

124

123

A

12

124

AB

12

24

AC

ABC

B

AE

24 ABD

ABE

2

345 D

2 BC

3

BD

4 ACD

245

C

123

4

AD

2

1234

2

4 ACE

BE

ADE

BCD

E

24

3

CD

BCE

34

CE

BDE

45

4

4 ABCD

ABCE

Not supported by any transactions

ABDE

ACDE

BCDE

ABCDE

L30/ 06-10-09

33

DE

CDE

Maximal vs. Closed Frequent Itemsets Minimum support = 2 124

123

A

12

124

AB

12 ABC

24

AC

AD

ABD

ABE

2

1234

B

AE

345

D

2 BC

3

BD

4 ACD

245

C

123

4

24

2

Closed but not maximal

null

2

4 ACE

BE

ADE

BCD

E

24

3

CD

BCE

Closed and maximal 34

CE

BDE

45

4

DE

CDE

4 ABCD

ABCE

ABDE

ACDE

BCDE

# Closed = 9 # Maximal = 4

ABCDE

L30/ 06-10-09

34

Related Documents


More Documents from "Allison Collier"