Association rule mining Lecture 30/ 06-10-09
L30/ 06-10-09
1
Fk-1 X F1 Method
– Extend each frequent (k-1) itemset with other frequent items. – E.g. {beer, diapers} can be merged with {bread}, that will result in 3-itemset. – This procedure is complete b’coz every frequent k-itemset is generated with the help of frequent (k-1)-itemset and a frequent 1itemset. L30/ 06-10-09
2
One drawback of this method is that it doesn’t prevent the same candidate itemset from being generated more than once. {bread, diapers, milk} can be obtained in no. of ways.
L30/ 06-10-09
3
• If the items are kept sorted in lexicographic order in each frequent itemset, then this problem can be avoided. • E.g. {bread, diapers} can be merged with {milk} but {diapers, milk} with {bread} must not be possible. Or {bread, milk} with {diapers} is not possible.
L30/ 06-10-09
4
Fk-1 X Fk-1 Method • Merges a pair of frequent (k-1) itemsets only if their first k-2 items are identical. • Let A={a1,a2,…..ak-1} and B={b1,b2,…bk-1} be a pair of frequent (k-1) itemsets. • A and B will be merged if ai=bi (for i=1,2,….k-2) and ak-1=/bk-1.
L30/ 06-10-09
5
This example shows the completeness of candidate generation procedure and advantage of using lexicographic ordering to avoid duplicate generation of the candidate itemset.
L30/ 06-10-09
6
•
Frequent itemset generation of Apriori Algorithm 1: k=1
•
2: Fk=
• •
3: repeat 4: k=k+1 5: Ck=ap-gen(Fk-1).
• •
{i | i ∈ I ∧ σ ({i} ≥ N × min sup}. {find all frequent 1-itemsets}
•
6: 7:
•
8:
• • • •
9: 10: 11: 12:
•
13: until Fk=
•
14: Result =
∈
{generates candidate itemsets}
for each transaction t T do Ct=subset(Ck,t). {Identify all candidates that belong to t} for each candidate itemset c
σ (c ) = σ ( c ) + 1
∈
Ct do {increment support count}
endfor endfor {c | c ∈ C k ∧ σ (c) ≥ N × min sup} F k= {find all frequent k-itemsets}
φ
Fk
Step 6 to 11 are computing support count L30/ 06-10-09
7
Support Counting • Support counting- determining the frequency of occurrence of each candidate itemset after candidate generation as well as candidate pruning step of “ap-gen” function. • Fig 6.2 explains one way of counting the support of each candidate itemset.
This approach is computationally expensive.
L30/ 06-10-09
8
Alternate approach • Enumerate the itemsets contained in each transaction. • Then determine whether each enumerated 3itemset corresponds to an existing candidate itemset. • If there is a match with a candidate then support count of that candidate is incremented. • The operation of matching the candidate itemset with enumerated 3-itemset can be done using hash tree structure. L30/ 06-10-09
9
Each itemset keeps its items in increasing lexicographic order
The prefix structure shown in this figure explains how systematic enumeration of the itemsets contained in a transaction can be done. L30/ 06-10-09
10
Support counting using hash tree • Candidate itemsets are divided into different buckets and stored in a hash tree. • For support counting, itemsets contained in a transaction are hashed into their appropriate buckets. • In this way, each itemset is matched only with candidate itemsets that belong to the same bucket.
L30/ 06-10-09
11
L30/ 06-10-09
12
Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent null
Maximal Itemsets
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
Infrequent Itemsets
ABCD
ABCE
L30/ 06-10-09
ABDE
ACDE
Border BCDE 13
Closed Itemset • An itemset is closed if none of its immediate supersets has the same support count as the itemset TID 1 2 3 4 5
Items {A,B} {B,C,D} {A,B,C,D} {A,B,D} {A,B,C,D}
Itemset {A} {B} {C} {D} {A,B} {A,C} {A,D} {B,C} {B,D} {C,D} L30/ 06-10-09
Support 4 5 3 4 4 2 3 3 4 3
Itemset {A,B,C} {A,B,D} {A,C,D} {B,C,D} {A,B,C,D}
Support 2 3 2 3 2
14
Maximal vs Closed Itemsets Transaction Ids
null
TID
Item s
1
ABC
2
ABCD
3
BCE
4
ACDE
5
DE
124
123
A
12
124
AB
12
24
AC
ABC
B
AE
24 ABD
ABE
2
345 D
2 BC
3
BD
4 ACD
245
C
123
4
AD
2
1234
2
4 ACE
BE
ADE
BCD
E
24
3
CD
BCE
34
CE
BDE
45
4
4 ABCD
ABCE
Not supported by any transactions
ABDE
ACDE
BCDE
ABCDE
L30/ 06-10-09
15
DE
CDE
Maximal vs Closed Frequent Itemsets Closed but not maximal
Minimum support = 2
null
124
123
A
12
124
AB
12 ABC
24
AC
B
AE
24 ABD
ABE
345 D
2 BC
3
BD
4 ACD
245
C
123
4
AD
2
1234
ACE
BE
2
4 ADE
Closed and maximal
BCD
E
24
3
CD
BCE
34
CE
BDE
45
4
DE
CDE
# Closed = 9 2
# Maximal = 4
4 ABCD
ABCE
ABDE
ACDE
L30/ 06-10-09 ABCDE
BCDE
16
Maximal vs Closed Itemsets Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets
L30/ 06-10-09
17
Hash func. H(p)=p mod 3
L30/ 06-10-09
18
L30/ 06-10-09
19
Reducing Number of Comparisons
• Candidate counting:
– Scan the database of transactions to determine the support of each candidate itemset – To reduce the number of comparisons, store the candidates in a hash structure • Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets
Transactions
N
TID 1 2 3 4 5
Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke L30/ 06-10-09 Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Hash Structure
k 20
Generate Hash Tree Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: • Hash function • Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)
Hash function 3,6,9 1,4,7
234 567 345 136
145
2,5,8 124 457
125 458
L30/ 06-10-09
159
356 357 689
367 368
21
Association Rule Discovery: Hash tree Candidate Hash Tree
Hash Function
1,4,7
3,6,9
2,5,8
234 567 145
136 345
Hash on 1, 4 or 7 124 457
125 458
159
L30/ 06-10-09
356 357 689
367 368
22
Association Rule Discovery: Hash tree Candidate Hash Tree
Hash Function
1,4,7
3,6,9
2,5,8
234 567 145
136 345
Hash on 2, 5 or 8 124 457
125 458
159
L30/ 06-10-09
356 357 689
367 368
23
Association Rule Discovery: Hash tree Candidate Hash Tree
Hash Function
1,4,7
3,6,9
2,5,8
234 567 145
136 345
Hash on 3, 6 or 9 124 457
125 458
159
L30/ 06-10-09
356 357 689
367 368
24
Subset Operation Given a transaction t, what are the possible subsets of size 3?
Transaction, t 1 2 3 5 6
Level 1 1 2 3 5 6
2 3 5 6
Level 2 12 3 5 6
13 5 6
123 125 126
135 136
15
L30/ 06-10-09
6
156
23 5 6
25 6
235 236
256 25
3
Subset Operation Using Hash Tree 1 2 3 5 6 transaction Hash Function
1+ 2356
2+ 356
1,4,7
3+ 56
3,6,9
2,5,8
234 567 145
136 345
124 457
125 458
159
356 357 689
L30/ 06-10-09
367 368
26
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356
12+ 356
1,4,7
3+ 56
3,6,9
2,5,8
13+ 56 234 567
15+ 6 145
136 345
124 457
125 458
159
L30/ 06-10-09
356 357 689
367 368
27
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356
12+ 356
1,4,7
3+ 56
3,6,9
2,5,8
13+ 56 234 567
15+ 6 145
136 345
124 457
125 458
159
356 357 689
367 368
Match transaction against 11 out of 15 candidates L30/ 06-10-09
28
Factors Affecting Complexity
• Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets – this may increase number of candidates and max length of frequent itemsets
• Dimensionality (number of items) of the data set – more space is needed to store support count of each item – if number of frequent items also increases, both computation and I/O costs may also increase
• Size of database – since Apriori makes multiple passes, run time of algorithm may increase with number of transactions
• Average transaction width – transaction width increases with denser data sets – This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)
L30/ 06-10-09
29
•
Compact Representation of Frequent Itemsets Some itemsets are redundant because they have identical support as their supersets
TID A1 A2 A3 A4 1 1 1 1 1 1 1 1 1 2 10 = 3× k 1 1 1 1 3
• Number of frequent itemsets
• Need a compact representation
10
∑ k =1
L30/ 06-10-09
30
Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent null Maximal Itemsets
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
Infrequent Itemsets ABCD
ABCE
L30/ 06-10-09
ABDE
ACDE
Border
BCDE
31
Closed Itemset • An itemset is closed if none of its immediate supersets has the same support as the itemset
TID 1 2 3 4 5
Items {A,B} {B,C,D} {A,B,C,D} {A,B,D} {A,B,C,D}
Itemset {A} {B} {C} {D} {A,B} {A,C} {A,D} {B,C} {B,D} L30/ 06-10-09 {C,D}
Support 4 5 3 4 4 2 3 3 4 3
Itemset {A,B,C} {A,B,D} {A,C,D} {B,C,D} {A,B,C,D}
Support 2 3 2 3 2
32
Maximal vs. Closed Itemsets TID
Item s
1
ABC
2
ABCD
3
BCE
4
ACDE
5
DE
Transaction Ids
null
124
123
A
12
124
AB
12
24
AC
ABC
B
AE
24 ABD
ABE
2
345 D
2 BC
3
BD
4 ACD
245
C
123
4
AD
2
1234
2
4 ACE
BE
ADE
BCD
E
24
3
CD
BCE
34
CE
BDE
45
4
4 ABCD
ABCE
Not supported by any transactions
ABDE
ACDE
BCDE
ABCDE
L30/ 06-10-09
33
DE
CDE
Maximal vs. Closed Frequent Itemsets Minimum support = 2 124
123
A
12
124
AB
12 ABC
24
AC
AD
ABD
ABE
2
1234
B
AE
345
D
2 BC
3
BD
4 ACD
245
C
123
4
24
2
Closed but not maximal
null
2
4 ACE
BE
ADE
BCD
E
24
3
CD
BCE
Closed and maximal 34
CE
BDE
45
4
DE
CDE
4 ABCD
ABCE
ABDE
ACDE
BCDE
# Closed = 9 # Maximal = 4
ABCDE
L30/ 06-10-09
34