Chap4

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Chap4 as PDF for free.

More details

  • Words: 8,202
  • Pages: 40
Chapter- 4 Association Pattern Mining

This chapter proposes different algorithms for mining association patterns and their extensions. Section 4.2 describes the proposed algorithms MBA (Matrix Based Association pattern mining algorithm), CBA (Cluster Based Association pattern mining algorithm), and Hybrid (Hybrid association pattern mining algorithm) for mining association patterns. Like most of the conventional association pattern mining algorithms, MBA and CBA are also based on the level-wise bottom up approach; while Hybrid is based on a novel approach. It is observed that algorithm Hybrid is efficient for mining dense databases. Section 4.3 discusses how these algorithms can be extended by adding new modules to achieve Generalized association pattern mining, Incremental association pattern mining, Multi-dimensional association pattern mining - containing categorical as well as quantitative attributes, and Maxmaxpattern mining. These algorithms have been implemented and used in the association rule mining system (AR-Miner) described in Chapter 6 of this thesis.

4.1 A Critical Look on Currently Used Algorithms Critical analysis of existing literature has led the author to identify the following limitations/shortcomings in available algorithms for frequent itemsetsets generation:

71



Recently proposed graph based approaches [38,48] for mining association patterns require the construction of the ‘Association Graph’ [38] and the ‘Frequent Pattern tree’ [48] respectively for every new value of minimum support, by scanning the entire database. Even for the case when database is incremented or mining is performed at some higher concept level existing ‘Association Graph’ or ‘Frequent Pattern tree’ has to be reconstructed from scratch level. Construction of the ‘Association Graph’ or ‘Frequent Pattern tree’ is a time consuming activity. Further, these approaches do not offer flexibility and reusability of computation during mining process.



Existing association pattern mining algorithms [1,3,4,38,40,50,113] may generate and test unnecessary insignificant 2-candidate itemsets due to the presence of false frequent items (refer Section 3.3.4). In [50] it is observed that execution time of the first two passes required by the Apriori algorithm [4] is about 62 percent of the total execution time. This is very motivation in filtering false frequent items at the early stage, which reduces the generation and testing of 2-candidate itemsets, C2.



AIS [1], SETM [40], Apriori [4] and its variants, and the graph based association pattern mining algorithms [33,38,48] do not have any advance information regarding the maximum size of frequent itemsets present in the database prior to actual mining, at the given minimum support. Hence, they continue to generate and test the candidate itemsets till candidate itemsets in a pass become null. Due to this many insignificant candidate itemsets are generated and tested. Testing of such insignificant candidate itemsets may also demand one more database scan for algorithms [1,4,38,40] and one more ‘Frequent Pattern tree’ scan for algorithm [48].



In the literature algorithms are available for maintaining the association rules, due to addition or deletion of transactions in the database. However,

72

algorithms are not available for mining incremental rules due to addition of more items/attributes. •

Algorithms [1,2,3,4,48,50,113] are not suitable/efficient for verification driven association rule mining. Further, problem worsens if data changes frequently.

4.2 Proposed Algorithms 4.2.1 Matrix Based Association Pattern Mining Algorithm (MBA) To address the above-mentioned limitations/shortcomings, a new algorithm (MBA) has been proposed in this section for mining the association patterns. MBA uses a complete, level-wise bottom-up search, with a vertical data layout (encoded) and enumerates all frequent itemsets. It is an iterative algorithm that counts k-itemsets in pass k by using Property 4.1. In each pass it scans encoded database of same size. The frequent itemsets of a pass are retained for the next higher pass, and the process is repeated until all frequent itemsets of various lengths have been enumerated. It uses the following property and Lemmas as given below including Lemma 3.1 given in Chapter 3. The proposed algorithm is based on the Flexible Uniform Framework described in Chapter 3. Phase 1 of FUF corresponds to encoding and counting has not been incorporated in the algorithm for better understanding.

Property 4.1: The support for the itemset (i1, i2, i3, i4, …………, ik ) is the number of 1s in β1∧ β2∧β3∧

β4∧ ……∧ βK, where the notation ‘∧’ is a logical AND operator.

Lemma 4.1:

73

If an itemset is not a frequent itemset, then any superset of this itemset can not be a frequent itemset.

Lemma 4.2: A frequent itemset (i1, i2, i3, i4, …………, ik) can not be extended to form a frequent itemset (i1, i2, i3, i4, …………, ik, x ) if pair (ik, x ) ∉ L2. The formal description of the algorithm is given below.

Algorithm: Matrix Based Association pattern mining algorithm (MBA). Input: Data file, D Minimum support, min_supp Output: Frequent itemsets, L Process: begin L1 = φ;

//L1 is the set of 1-frequent items

L = φ; L1 = identify_frequent_item(count_table, min_supp); if |L1| < 2 then print “No association pattern exists at given support” exit(0); else ξ = make_concentrator(D, L1); //Construction of the Concentrator 74

forall l ∈ L1 do if l.count < min_supp then //Detection of false 1-frequent item L1 = L1\ l ; //Filtering of false 1-frequent item from L1 endif; endfor; if |L1| < 2 then print “No association pattern exists at given support” exit(0); else γ = predict_size(ξ, min_supp); //Predicts the size of the maximum length //frequent itemset which may be present in the //database at given minimum support L = L1; for(k = 2; Lk-1 ≠ φ AND k ≤ γ; k++) do begin CK = gen_pattern(Lk-1, ξ); //Generates new candidate itemsets of size k, where //2 ≤ k ≤ γ forall c ∈ CK do //test whether a candidate itemset is frequent if c.count ≥ min_supp then Lk = Lk ∪ c; endif; endfor; L = L ∪ Lk; endfor; 75

endif; endif; end; function gen_pattern(Lk-1, ξ) //This function generates the k-candidate itemsets from the given (k-1)-frequent //itemsets and also calculates the actual support of generated candidate itemsets. Ck = φ; for(i = 1; i ≤ Lk-1-1; i++ ) do begin for(j = i+1; j ≤  Lk-1; j++) do begin if (left(ti,  ti -1) = = (left(tj,  tj -1)) then // ti is the ith row of Lk-1 // tj is the jth row of Lk-1 if left(ti,  ti -1) = φ then c.item = ti ∪ tj; else c.item = left(ti,  ti -1) + right(tj, 1); endif; c.count = supp(c.item, ξ); Ck = Ck ∪ c; else exitfor; endif; endfor; endfor; return Ck; Brief description of the algorithm is given below: 76

L1 (set of frequent items of size 1) and L (set of frequent itemsets) are initialized to null. Function identify_frequent_item() is called. This function identifies 1-frequent items and assigns to L1 (this step corresponds to Phase 2 of FUF). The input parameters to this function are count_table (constructed during Phase 1 of FUF and not shown in the algorithm) and user defined minimum support, min_supp. Description of this function has been given in Section 3.3.2 of Chapter 3. If number of 1-frequent items in L1 is less than 2 i.e. |L1| < 2 then no association pattern exists in the database at given minimum support and algorithm terminates. Otherwise, the function make_concentrator() is invoked (Phase 3 of FUF). This function constructs the Concentrator for given database at the user defined support. The input parameters of this function are database (D) and 1-frequent items (L 1) (obtained during Phase 2). For constructing the Concentrator one full scan of the database is required. Detailed description of this function is given in Section 3.3.3 of Chapter 3. After constructing the Concentrator, false frequent items are detected and filtered out from L1 (Phase 4 of FUF), as explained in Section 3.3.4 of Chapter 3. After filtering false frequent items from L1, size of L1 is again tested. If size of L1 becomes less than two then algorithm terminates since no association patterns will exist in the data at the given minimum support. Otherwise function predict_size() is called (Phase 5 of FUF). The input parameters to this function are the Concentrator, ξ and minimum support, min_supp. While it returns γ, upper limit for the size of the largest frequent itemset, which may present in the database, at the user defined minimum support. This function has been explained in Section 3.3.5 of Chapter 3. Now function gen_pattern() is called to generate k-candidate itemset by using (k-1)frequent itemsets. It uses the Lemma 4.1 and 4.2 during this process. Generated candidate itemsets whose all k-1 subsets are not frequent are filtered out and the remaining itemsets are known as k-candidate itemsets, Ck. This function also finds the actual support of generated candidate itemsets by using Property 4.1. The approach used for generating candidate itemsets is similar to [4]. However, it differs in support calculation mechanism. 77

Now each candidate itemset, c ∈ Ck is tested whether it is frequent. If c.count ≥ min_supp, then ‘c’ is frequent itemset and appended in L k, which is k-frequent itemsets. At the end of pass k, Lk is appended to L. This process is repeated till Lk ≠ φ AND k ≤ γ. At the end L gives the frequent itemsets.

4.2.1.1 Performance Evaluation for MBA To assess the performance of Matrix Based Association pattern mining algorithm developed in this research experiments have been performed on a Pentium-4 based PC with a CPU clock rate of 1.6 GHz, 128 MB of main memory, and running Window 98. Experiments have been performed on real-life datasets. Real-life datasets were obtained from a retail departmental store (M/s NEW CITY SUPERMARKET Bangaloe, India) which consisted of the details of sales transactions. A transaction contained data about the items purchased by the customer in a visit to the store. A snap of such dataset, used for experiments in this research is shown in Appendix-D. Following tables show execution time taken by MBA for different datasets.

Table 4.1: Execution times for different datasets at different value of minimum support Dataset

Minimum Support 0.01 %

0.51 %

1.01%

T12.I8.D70K

859

389

378

T12.I11.D70K

3091

820

836

78

Algorithm

Table 4.2: Execution times for data having 8 frequent items at minimum support of 0.01% (T12.I8.D70K)

MBA

Number of Transactions in K 15

20

25

30

15

16

18

4

8

9

211

35

40

45

50

55

60

65

23

25

26

28

31

32

34

2

1

9

9

0

4

7

70

859

Algorithm

Table 4.3: Execution times for data having 5 frequent items at minimum support of 0.01% (T12.I5.D70K) Number of Transactions in K 15

20

25

30

35

40

45

50

55

60

65

MBA 68

74

77

83

90

96

98

10 6

10 9

115 119 126

Algorithm

Number of Transactions

Table 4.4: Execution times at minimum support of 0.01% Number of Frequent Items 3

4

5

6

79

7

8

11

17

70

70K

MBA

49

76

125

313

620

859

3091

4085

Table 4.5: Execution times in seconds for MBA (Dense dataset-T7.I10.D5K) Algorithm

Minimum Support 30% 40% 50% 60% 70% 80%

MBA

1 039

547

262

129

63

35

Table 4.1 gives the results obtained during experiments conducted to see the effect of minimum support on the execution time for a given dataset. The results indicate that execution time increases with the decrease in minimum support. It is because of at the reduced minimum support there were more itemsets for testing. Table 4.2 shows that for a given number of frequent items (eight) and a specified minimum support of 0.01% the execution time increases continuously as the size of the dataset (number of transactions) increases. Similar trends at lower number of frequent items (five) and at the same minimum support of 0.01% also noticed through Table 4.3. However rate of increase is rapid for the case given in Table 4.2. It may attribute due to increase in the number of frequent items. Table 4.4 depicts the effect of number of frequent items on the execution time at minimum support of 0.01%. It is noticed that the execution time increase drastically as the number of frequent items increase beyond five and continue to behave same way with the increase in number of frequent items. This therefore puts a limit to use MBA to those datasets where frequent items are around five at the given minimum support.

80

Table 4.5 gives the results obtained during experiments conducted on dense data to see the effect of minimum support on the execution time. The results indicate that execution time increases with the decrease in minimum support. Further, execution time required, even for data having only few thousand transactions, is very large compared to sparse data. It may be attributed to combinatorial explosion of itemsets, making the algorithm impracticable for dense data. MBA presented above scans all the tuples (transactions) of the Concentrator during each pass. In a pass k it also scans the tuples (transactions) of the Concentrator containing less than k 1-frequent items. Statistically, scanning of all such tuples is insignificant. Thus it seems to be inefficient; in case of data to be mined is large and having more frequent items at the given minimum support (unnecessary scanning of tuples requires more disk I/O operations and results in degradation of its performance). This calls for modification to MBA to handle such cases.

4.2.2 Cluster Based Association Pattern Mining Algorithm (CBA) Cluster Based Association pattern mining algorithm (CBA) uses a novel approach to reduce the disk I/O operations. It divides the tuples of the Concentrator in γ-1clusters at the given minimum support. It may be recalled that γ is the size of the maximum size frequent itemset, which may be present in the data at the given minimum support. The value of γ can be predicted as in case of MBA. During a particular pass only those tuples that seem to be statistically useful are to be scanned, as illustrated by the following example: •

Suppose for a given database at the assumed minimum support of 15%, it is predicted that six is the maximum size of the frequent itemset which may be present in the data in the worst case. Then tuples of the Concentrator are divided in 5 clusters (i.e. γ -1). First four clusters: ω2, ω3, ω4, and ω5 contain all

81

those tuples containing 2, 3, 4, and 5 frequent items respectively and the last cluster, ω6 contain those tuples in which number of frequent items are greater than or equal to 6. When frequent itemsets of size 5 are searched, these itemsets may only be present in clusters ω5 and ω6. Thus, for this particular case transactions available in clusters ω2, ω3, and ω4 are statistically insignificant and should not be scanned. Therefore, during each higher pass number of transactions scanned reduces compared to the previous pass. This scan reduction approach reduces the amount of disk I/O required and will be going to make CBA more efficient than MBA especially in the situation where size of database is large and number of 1-frequent items is also more. Like MBA, Cluster Based Association pattern mining algorithm also uses a complete, level-wise bottom-up search, with a vertical data layout (encoded) and enumerates all frequent itemsets. It is an iterative algorithm that counts itemsets of a specific length in a given database pass by using Property 4.1. However, with contrast to MBA in each pass it scans only those transactions, which seem statistically significant. The frequent itemsets of a pass are retained for the next higher pass, and the process is repeated until all frequent itemsets of various lengths have been enumerated. Following is the lemma, other than used by MBA on which CBA is based.

Lemma 4.4: A frequent itemset of size k can not be present in any cluster ωx, where x < (k-1)

and ωx is a cluster containing tuples having x 1-frequent items. Formal description of the proposed algorithm is given below.

Algorithm: Cluster Based Association pattern mining algorithm (CBA).

82

Input: Data file, D Minimum support, min_supp Output: Frequent itemsets, L Process: begin L1 = φ;

//L1 is the set of 1-frequent items

L = φ; L1 = identify_frequent_item(count_table, min_supp); //Identifies 1-frequent items – Phase 2 of FUF if |L1| < 2 then print “No association pattern exist at given support” exit(0); else ξ = make_concentrator(D, L1); //Constructs the Concentrator – Phase 3 of FUF //Loop 1 //Detection and filtering of false frequent items – Phase 4 of FUF forall l ∈ L1 do if l.count < min_supp then // Detection of false 1-frequent item L1 = L1\ l ; // Filtering of false 1-frequent item from L1 endif; endfor; if |L1| < 2 then print “No association pattern exists at given support” exit(0);

83

else

//Perform activities corresponds to Phase 5 of FUF altertable ξ add (hcount) attribute; k = no_of_ attribute(ξ); //Loop 2 for (i = 1; i ≤ |ξ|; i++) do begin for (j = 2; j ≤ |ξ|; j++) do begin ti.hcount = ti.hcount + ti,j; endfor; endfor; position = (|ξ|*min_supp)/100; ξ’ = select all from ξ order by hcount desc; γ = ξ’position.hcount; ξ = ξ’; // *******Association Pattern Generation //Loop 3 making of clusters for (i = 1; i ≤ γ; i++) do begin ξ(i) = select all from ξ where ξ.hcount ≥ i; endfor; //Loop 4 Candidate itemset generation for (k = 2; Lk-1 ≠ φ AND k ≤ γ; k++) do begin CK = gen_pattern_cluster(Lk-1, ξ(k)); //Generates new candidate itemsets of size k, //where 2 ≤ k ≤ γ forall c ∈ CK do //Tests whether a candidate itemset is frequent if c.count ≥ min_supp then Lk = Lk ∪ c; endif; 84

endfor; L = L ∪ Lk; endfor; endif; endif; end; function gen_pattern_cluster(Lk-1, ξ(k)) //This function generates the k-candidate itemsets from the given (k-1)-frequent itemsets //and also calculates the actual support of generated candidates. Ck = φ; //Loop 5 for (i = 1; i ≤ Lk-1-1; i++ ) do begin for (j = i+1; j ≤  Lk-1; j++) do begin if (left(ti,  ti -1) = = (left(tj,  tj -1)) then // ti is the ith row of Lk-1 // tj is the jth row of Lk-1 if left(ti,  ti -1) = φ then c.item = ti ∪ tj; else c.item = left(ti,  ti -1) + right(tj, 1); endif; c.count = supp(c.item, ξ(k)); Ck = Ck ∪ c; else exitfor; endif; 85

endfor; endfor; return Ck; Brief description of the algorithm is given below. L1 (set of frequent items of size 1) and L (set of frequent itemsets) are initialized to null. Function identify_frequent_item() is called which identifies 1-frequent items and assigns to L1 (Phase 2 of FUF). The input parameters to this function are count_table constructed during Phase 1 of FUF (not shown in the algorithm) and user defined minimum support, min_supp. Description of this function has been given in Section 3.3.2 of Chapter 3. If given

minimum

|L1| < 2 then no association pattern exists in the database at

support

and

algorithm

terminates.

Otherwise

function

make_concentrator() is invoked (Phase 3 of FUF) which constructs the Concentrator for given database at the user defined support as explained in Section 3.3.3 of Chapter 3. After constructing the Concentrator, false frequent items are detected and filtered out from L1 (Phase 4 of FUF), as explained in Section 3.3.4 of Chapter 3. Next, activities corresponds to Phase 5 of FUF starts and the value of γ is computed. Now tuples of the Concentrator are divided in γ-1 clusters (Loop 3) in such a way that each cluster i, where 2 ≤ i < γ, contains only those tuples in which number of 1’s are i (containing i 1-frequent items) except last cluster which will contain all those tuples for which number of 1’s are greater than or equal to γ. Now function gen_pattern_cluster(), which is similar to gen_pattern() used in the algorithm MBA, is called (Loop 4) to generate candidate itemset of size k i.e. Ck by using frequent itemsets of previous pass i.e. (k-1). Now each c ∈ Ck is to be tested (by using Property 4.1) whether it is frequent or not by searching it in all those clusters, ωi for which i ≥ k-1. With contrast to gen_pattern(), this function scans only those transactions which are statistically significant (Lemma 4.4). Thus, during each 86

higher pass number of scanned transactions reduces drastically. For each itemset, occurrences are counted and the itemset for which item.count ≥ min_supp is a frequent itemset and this itemset is appended in Lk, which is k-frequent itemset. At the end of pass k, Lk is appended to L. This process is repeated till L k ≠ φ AND k ≤ γ. At the end, L gives the frequent itemsets.

4.2.2.1 Performance Evaluation for CBA For testing the performance of CBA (Cluster Based Association pattern mining algorithm) similar experiments were performed as in case of MBA. Following tables show execution times taken by the algorithm for different datasets.

Table 4.6: Execution times in seconds for CBA Dataset

Minimum Support 0.01 %

0.51 %

1.01%

265

248

242

T12.I11.D70K 519

296

283

T12.I8.D70K

Table 4.7: Execution time for data having 8 frequent items at minimum support of 0.01%

87

Algorithm

Number of Transactions in K

CBA

15

20

25

30

35

40

45

50

55

60

65

70

12 2

13 8

14 7

16 2

16 9

17 9

19 3

20 2

21 5

22 9

24 2

267

Table 4.8: Execution times for data having 5 frequent

Algorithm

items at minimum support of 0.01%

CBA

Number of Transactions in K 15

20

25

30

35

40

45

50

55

60

65

70

70

79

88

99

10 2

111 12 3

13 3

14 3

15 6

17 4

193

Number of Transactions

Algorithm

Table 4.9: Execution times at minimum support of 0.01%

3

4

5

6

7

8

11

17

70K

CBA

93

124

161

176

217

267

519

713

MBA

49

76

125

313

620

859

3091

4085

Number of Frequent Items

88

Table 4.10: Execution times in seconds for T7.I10.D5K (Dense dataset) Algorithm

CBA

Minimum Support 30%

40%

50%

1156

587

323

60% 70% 80%

153

80

71

Table 4.6 gives the results obtained during experiments conducted to see the effect of minimum support on the execution time for a given dataset. The results indicate that execution time of CBA also increases like MBA with the decrease in minimum support. However, in this case rate of increase of execution time on decreasing support is less compared to MBA. It was as per expectation because at reduced minimum support there were more candidate itemsets for testing. Table 4.6 also indicates that difference in execution time taken at support 0.51% and 1.01% were approximately same. The reason is number of candidate itemsets generated and tested at these support were approximately same. Table 4.7 shows that for a given number of frequent items (eight) and a specified minimum support of 0.01% the execution time increases continuously as the size of the dataset (number of transactions) increases. Similar trends at lower number of frequent items (five) and same minimum support of 0.01% is also noticed through Table 4.8. However rate of increase is rapid for the case given in Table 4.7. It may be attributed increase in the number of frequent items.

Table 4.9 depicts the effect of number of frequent items on the execution time at the given minimum support. It is noticed that the execution time increases as the number of frequent items increase. However, the rate of increase is much less compared to MBA. Further, it is to be noted that (Table 4.9) when frequent items are more than 5 CBA becomes more efficient compared to MBA. Performance gap increases with number of frequent items in the data at the given minimum support. As both MBA and CBA have used same candidate itemsets generation and testing (counting) techniques, the improvement in the execution times for CBA is mainly due to the reduction in 89

scanned database size during higher passes. It is noticed that MBA scanned same size of database during each successive pass while CBA reduced the scanned database size progressively during each higher pass. However, during pass 1 both MBA and CBA scanned the same size database and time required for MBA is less than CBA as shown in Figure 4.1. It was expected because during this pass CBA also performed some computational efforts for making the clusters. Performance gaps between MBA and CBA increases with higher passes. Table 4.10 gives the results obtained during experiments conducted on dense data to see the effect of minimum support on the execution time. The results indicate that execution time increases with the decrease in minimum support. Further, execution time required, even for data having only few thousand transactions, is very large compared to sparse data at the reduced minimum support. It may be attributed to combinatorial explosion of itemsets, making the algorithm impracticable for dense data.

T12.I8.D70K 500 MBA 450

CBA

Time (Second)

400 350 300 250 200 150 100 50 0 1

2

Pass Num ber

(a)

90

3

T12.I11.D70K

1800

MBA CBA

1600

Time (Second)

1400 1200 1000 800 600 400 200 0 1

2

3

4

Pass Num ber

(b) Figure 4.1: Time taken by MBA and CBA in different passes (min_supp = 0.01)

4.2.3 Hybrid Association Pattern Mining Algorithm (Hybrid) Traditional association pattern mining algorithms have adopted an Apriori-like candidate set generation and test approach (level-wise bottom up approach). These algorithms always start with searching of frequent items of size one (known as 1frequent items) and sequentially reach to the frequent itemsets of largest size present in the database. This approach works well when database is not dense. To mine dense data efficiently a novel algorithm called Hybrid is presented below. Hybrid algorithm is complete and based on vertical data layout. It uses hybrid approach - a combination of bottom up and top down approach, for searching frequent itemsets. Initial steps of the algorithm are same as in the proposed algorithms MBA and CBA. After computing the value of γ at the given minimum support, algorithm starts searching of frequent itemset of this maximum size and the process of searching continues till one frequent itemset of the largest size present in

91

the data is encountered. This frequent pattern of maximum size is a maxpattern and c

called maxmaxpattern (Pm) in this work. Mining of maxmaxpattern uses level-wise top down approach. Now for mining remaining frequent itemsets of the database (all c

these frequent itemsets will contain at least one item from ( L1 - Pm )) level-wise bottom up approach is used. The formal description of the algorithm that incorporates the above ideas is given below.

Algorithm: Hybrid association pattern mining algorithm (Hybrid). Input: Data file, D Minimum support, min_supp

Output: Maxpatterns, L Process: begin //Step 1 L1 = φ;

//L1 is the set of 1-frequent items

L = φ; //Step 2 L1 = identify_frequent_item(count_table, min_supp); //Step 3 if |L1| < 2 then print “No association pattern exists at given support” exit(0); else ξ = make_concentrator(D, L1); 92

//Construction of the Concentrator //Step 4 forall l ∈ L1 do if l.count < min_supp then //Detection of false 1-frequent items L1 = L1\ l ; //Filtering of false 1-frequent items from L1 endif; endfor; //Step 5 if |L1| < 2 then print “No association pattern exists at given support” exit(0); else γ = predict_size(ξ, min_supp); //Predicts the size of the maximum frequent //pattern(s) which may be present in the database //at the given minimum support //Step 6 c

Pm = combined_pattern(ξ, L1, γ); //Step 7 c

forall e ∈ Pm do Z = L1\ e; endfor; c

L = L ∪ Pm ∪ L1; c

forall e ∈ Pm do L1 = L1\ e; endfor; 93

first = true; //Step 8 for (k = 2; Lk-1 ≠ φ and γ ≥ k; k++) do begin if first = true then CK = incr_gen_pattern1(Z, ξ, L1); first = false; else CK = incr_gen_pattern(Lk-1, ξ); endif; forall c ∈ CK do //Tests whether a candidate itemset is frequent if c.count ≥ min_supp then Lk = Lk ∪ c; endif; endfor; L = L ∪ Lk ; endfor; //Step 9 L = maxpattern(L); endif; endif; end; function combined_pattern(ξ, L1, γ) for (; γ ≥ 2; γ--) do begin forall s ∈ subset(L1, γ) do if supp(s, ξ) ≥ min_supp then return s; exitfunction; 94

endif; endfor; endfor; function maxpattern(L); // Generate maxpatterns from given frequent itemsets Lmax = φ; // Lmax - set of maxpatterns L = order by size of pattern in desc // Loop 1 forall l ∈L do begin flag = false; // Loop 2 forall l ’∈ Lmax do begin if l ⊆ l ’ then flag = true; exitfor; endif; endfor; if flag = false then Lmax= Lmax∪ l ; endif; endfor; return Lmax; It is to be noted that incr_gen_pattern1() and incr_gen_pattern() are similar to gen_pattern() described earlier. A brief description of each step of the algorithm is given below.

95

Steps 1-5: These steps of the algorithm are same as in MBA and CBA and are correspond to Phases 2 to 5 of the proposed Flexible Uniform Framework. Step 6: During this step function combined_pattern() is called. The input parameters to the function are the Concentrator (ξ), 1-frequent item set (L1), and predicted size of the largest size frequent itemset (γ). The function searches any one frequent itemset of maximum size (having greater or equal support amongst this size frequent c

itemsets) present in the Concentrator at the given min_supp and assign this to Pm. The function works as follows: First, it will search whether any itemset of predicted size γ is frequent. For this, a sub-function subset() is called which generates the pattern of size γ, one by one (from items of L1) and then it is tested whether it is frequent. subset() generates those itemsets first which have higher potential to become frequent. As frequent pattern of this size is encountered function terminates otherwise the current value of γ is decremented by 1 i.e. γ = γ - 1. Now for this new value of γ above steps are repeated, provided the current value of γ is greater than 2 i.e. γ > 2. c

Step 7: During this step difference of L1 and Pm is obtained and stored in Z. c

Maxmaxpattern (Pm) and members of L1 are appended in L. Then items which have c

constituted Pm are deleted from L1.

Step 8: This step generates all the frequent itemsets containing at least one item c

from Z (obtained in step 7) and remaining items from Pm. For each item, i ∈ Z, 2candidate itemsets will be generated by calling function incr_gen_pattern1(). For instance, if z1 ∈ Z then 2-candidate itemsets due to z1 can be obtained by making c

c

pairs (z1, e) where ‘e’ is a 1-frequent item and e ∈ Pm. Now z1 is appended in Pm 96

and deleted from Z and same procedure is repeated for remaining items of Z. Now each 2-candidate itemset is tested whether it is frequent. If it is frequent then it is appended in L2. At the end L2 is appended in L. Afterwards, frequent items of size (k-1), where k≥ 3 are used for making candidate itemsets of size k by calling function incr_gen_pattern(). The working of this function is similar to incr_gen_pattern() described earlier. Each candidate itemsets (of size k) is tested. If it is frequent then it is appended in Lk. At the end, Lk is appended in L. This process is repeated till | Lk | ≥ 2. Step 9: During this step maxpatterns are computed by calling maxpattern(). Brief description of the function is given below. The input parameter to the function is L, while it returns Lmax. Firstly, Lmax is initialized to null and flag is set to false. L is arranged in descending order on the basis of size of the frequent patterns i. e. number of items in the frequent patterns. Now each element l ∈ L is picked up (Loop 1) and tested whether it is subset of any pattern, l ’∈ Lmax (Loop 2). If it passes the test then flag is set to true and exit from Loop 2. If flag is false then l is appended to Lmax. At the end function returns Lmax, set of maxpatterns. To demonstrate the working of the algorithm, an example is outlined as given below. Let item universe be Ι = {A, B, C, D, E} and transaction universe be Tid = {100, 200, 300, 400}. In Table 4.11 100, 200, 300, and 400 are unique identifiers of the four transactions. Let

min_supp = 50% (to be frequent, an itemset must occur in at least 2

transactions).

97

Table 4.11: A transaction database Tid 100

Items A, C, E

200

B, D

300

A, B, C, D

400

B, C, D

Each row in the above table can be considered as a transaction. The occurance of each item in the database is counted. Table 4.12 shows the count value of each item present in the database. Table 4.12: Frequency counts of items Item A

Count 2

B

3

C

3

D

3

E

1

Now trace the steps of algorithm Hybrid as follows: 1. L1 and L are initialized to null.

98

2. Function identify_frequent_item() is called. It returns L1 = {A, B, C, D} as 1frequent items set at the given minimum support of 50%. 3. The size of L1 is tested. Since L 1≥ 2, the Concentrator, ξ is constructed as shown in Table 4.13. Note that, the item ‘E’ is not present in the Concentrator because it is not frequent. Table 4.13: Concentrator, ξ Tid 100 200 300 400

A 1 0 1 0

B 0 1 1 1

C 1 0 1 1

D 0 1 1 1

4. Now false frequent items will be searched by using Lemma 3.1. Note that no false frequent item exists in the Concentrator at the given minimum support. 5. Now function predict_size() is called. This function will return the γ, maximum size of frequent itemset(s) which may present in the database, at the given minimum support. For the example database the value of γ is three (3) at the given min_supp of 50%. 6. During this step, one frequent itemset of maximum size present in the database, for

given

value

of

min_supp,

is

combined_pattern(). This function call a

searched

by

calling

function

sub-function subset(), which will

generate the subset of size γ from 1-frequent set, L1 i.e. subset of size 3 from {A, B, C, D}. These subset of L1 are generated and tested one by one. Let the first pattern generated by subset() be {B, C, D}. Now this pattern is tested – whether is it frequent. Since count value of {B, C, D} is 2, which is equal to c

minimum support). Thus {B, C, D} is a maxmaxpattern (Pm).

99

c

7. During this step L1/Pm ( i.e. {A, B, C, D}/{B, C, D}) is computed and the result is stored in Z (i.e. Z = {A}). Now {B, C, D} (maxmaxpattern) and {A}, {B}, {C}, and {D} (members of L1) are appended in L. Then items B, C, and D (constituents c

of Pm ) are deleted from L1 and result is stored in Z i.e Z = {A}.

8. This step generates all the frequent itemsets having at least one item from Z c

(obtained in step 7) and remaining items from Pm. Since for the given case, ‘A’ is the only element present in Z. Now candidate itemset of size 2 will be generated by calling function incr_gen_pattern1() i.e. C2 = {{AB}, {AC}, {AD}}. Now these itemset will be tested. It is to be noted that {AC} (having support of 50%) is the only frequent itemset. This itemset will be assigned to L 2. Since size of L2 is one, loop will terminate and contents of L2 will be appended in L. 9. During this step maxpatterns will be generated from all frequent itemsets available in L. The result will be stored in L. Thus, L= {BCD, AC}.

4.2.3.1 Performance Evaluation for Hybrid Table 4.14 shows the execution times taken by the Hybrid algorithm when tested on a dense data (T7.I10.D5K) at different value of minimum support. It is noticed from the table that the execution time decreased with decreasing value of minimum support in hybrid algorithm. While the execution times increased with decreasing value of minimum support in case of MBA (Table 4.5) and CBA (Table 4.10.). Results of MBA and CBA have been reproduced in Table 4.14 for comparison. Reduction in execution time in case of Hybrid with decrease in minimum support was as per expectation. Because at reduced minimum support there are better chances to c

get maxmaxpattern (Pm) of larger size. Thus lesser number of candidate itemsets will be generated and tested. This is the reason Hybrid algorithm took the lead up to 50% 100

of minimum support value. After that performance of Hybrid degraded due to the reason given above and MBA and CBA improved. It is also observed that at a given minimum support, when the values of predicted size of the maximum size frequent itemset, size of L1, and average frequent items per transaction are closer to each other in that case the Hybrid algorithm was most efficient. Table 4.14: Execution times in seconds for MBA, CBA, and Hybrid

Dataset

Minimum Support

Algorithm 30%

T7.I10.D5K (Dense dataset)

1

MBA

039

40%

50%

60%

70%

80%

547

262

129

63

35

CBA

1156

587

323

153

80

71

Hybrid

19

76

111

185

270

314

4.3 Extensions of Association Pattern Mining Algorithms presented in Section 4.2 generate all association patterns satisfying given confidence and support requirements. This section presents the algorithms based on proposed FUF, which either generate association pattern under other requirement or extend the basic definition of what an association pattern is.

4.3.1 Mining of Generalized Association Pattern Items present in the database can be generalized to the required concept level according to the approach discussed in Section 3.4.2. Now mining can be performed 101

on these generalized items by using any appropriate algorithm presented in Section 4.2 of this chapter.

4.3.2 Mining Multi-dimensional Database Business and scientific domains may have multi-dimensional databases with quantitative and categorical attributes. This sub-section discusses how the algorithms proposed in Section 4.2 can be extended for mining association patterns in such databases. Quantitative association rule problem can be mapped to the Boolean association rule problem. If all attributes are categorical or the quantitative attributes have only a few values, this mapping is straightforward. Conceptually, instead of having just one field in the table for each attribute, there will be as many fields as the number of attribute values. For instance, the categorical attribute (X, {1, 2, 3, 4}) can be transformed into the following set of Boolean attributes: {((X_1, {0, 1}, (X_2, {0, 1}, (X_3, {0, 1}, (X_4, {0, 1})} such that X_i = 0 if X ≠ i and X_i = 1 if X= i. If the domain of values for a quantitative attribute is large, then static discretization approach is adopted as given below. Prior to actual mining, all quantitative attributes of interest present in the database are identified and discretized using predefined concept hierarchies, where numeric values are replaced by ranges. The discretized numeric attributes, with their values, can then be treated as categorical attributes, where each range is considered a category. Now on the resultant data the proposed algorithm for finding association patterns can be applied to find all frequent predicate sets (rather than frequent itemsets). The problem of mining multi-dimensional database containing categorical and/or quantitative attributes has been decomposed into steps given below. 1.

Identify quantitative attributes/categorical attributes of interest present in the database to be mined. 102

2(a) For each quantitative attribute obtained in step 1 determine the number of partitions to be done for the domain of such attributes. For quantitative attributes that are not partitioned into intervals, the values are mapped to consecutive integers such that the order of the values is preserved. If a quantitative attribute is partitioned into intervals, the intervals are mapped to consecutive integers, such that the order of the intervals is preserved. 2(b) For each categorical attribute, obtained in step 1 map the values of the attribute to a set of consecutive integers. In this step categorical attributes may also be generalized to higher conceptual levels if desired. 3.

Now find association patterns by using algorithm proposed in Section 4.2 (Select the algorithm according to database characteristics).

4.

Now generate the association rules from association patterns obtained in step 3.

4.3.3 Mining of Maxmaxpattern Maxmaxpattern is a maximum size association pattern present in the database and having maximum support amongst the association patterns of the same size at the given minimum support. Mining of such pattern is an important data mining task in business domain. For instance, a mobile service provider is offering 20 different services on payment to their valuable customers. Company may be interested in knowing, which group of their services do the customers mostly like at different value of minimum support. Such type of database is generally dense. The mining of maxmaxpattern in such type of database can be done by modifying algorithm Hybrid (by deleting steps 7 onwards).

4.3.4 Incremental Association Patterns Mining 103

Since mining of association patterns is computationally expensive, an efficient approach to maintain the frequent itemsets is needed when the database is updated. Database update occurs on account of insertion / deletion of transactions or attributes in the original database. Several algorithms [20,27,28,97] are available in literature addressing the first issue. However, the second issue i.e. addition of attributes (deletion of attribute is straightforward) has not been addressed to the best of knowledge of the author. The same has been taken up in this work. One simple but inefficient solution to the problem is to rerun the mining algorithm on the updated database. However, proposed approach will generate the association patterns that appear due to only incremental database without knowing the association patterns for the original database. Formal presentation of the proposed algorithm along with the description for each step is given below.

Algorithm: Algorithm for mining Incremental Association Patterns (IAP). Input: Concentrator of the original database (D), ξ Incremental data, ∆d Minimum support, min_supp Output: Frequent itemsets due to incremental attributes, L∆d Process: begin //Step 1 L1∆d = φ;

//L1∆d is the set of 1-frequent items due to ∆d

L∆d = φ; //Step 2 104

L1∆d = identify_frequent_item(incr_count_table, min_supp); //Step 3 if | L1∆d | < 1 then print “No association pattern exist due to incremental database” exit(0); else ξ∆d = make_concentrator_incr(∆d, L1∆d); //Construction of the Concentrator for ∆d //Step 4 forall l ∈ L1∆d do if l.count < min_supp then //Detection of false 1-frequent items L1∆d = L1∆d \ l ; //Filtering of false 1-frequent items from L1∆d endif; endfor; //Step 5 if | L1∆d | < 1 then print “No association pattern exist for given support” exit(0); else ξ’∆d = join(ξ, ξ∆d); L∆d = L1∆d; first = true; //Step 6 for (k = 2; Lk-1∆d ≠ φ; k++) do begin if first = true then CK∆d = incr_gen_pattern1(Lk-1∆d, ξ’∆d); first = false; else 105

CK∆d = incr_gen_pattern(Lk-1∆d, ξ’∆d); endif; forall c ∈ CK∆d do // Test whether a candidate itemset is frequent if c.count ≥ min_supp then Lk∆d = Lk∆d ∪ c; endif; endfor; L∆d = L∆d ∪ Lk∆d; endfor; endif; endif; end; function incr_gen_pattern1(Lk-1∆d, ξ’∆d, L1) Ck∆d = φ; for (i = 1; i ≤ Lk-1∆d ; i++ ) do begin for (j = 1; j ≤  Lk-1; j++) do begin c.item = ti,+ tj; // ti is the ith row of Lk-1∆d // tj is the jth row of Lk-1 c.count = supp(c.item, ξ’∆d); Ck∆d = Ck∆d ∪ c; endfor; L1 = L1 ∪ ti; endfor; return Ck∆d;

106

function incr_gen_pattern(Lk-1∆d, ξ’∆d) Ck∆d = φ; for (i = 1; i ≤ Lk-1∆d -1; i++ ) do begin for (j = i+1; j ≤  Lk-1∆d ; j++) do begin if (left(ti,  ti -1) = = (left(tj,  tj -1)) then // ti is the ith row of Lk-1∆d // tj is the jth row of Lk-1∆d c.item = left(ti,  ti -1) + right(tj, 1); c.count = supp(c.item, ξ’∆d); Ck∆d = Ck∆d ∪ c; else exitfor; endif; endfor; endfor; return Ck∆d; Brief description of the algorithm is given below. Step 1: This step is the initialization step where L1∆d (set of 1-frequent items due to ∆ d) and L∆d are initialized to null. Step 2: During this step function identify_frequent_item() is called to find the 1frequent items (L1∆d) present in the incremental database (∆d) at the user defined mininimum support. The input parameters to this function are min_supp and incr_count_table (structure of this table is same as of count_table described earlier in Section 3.3.1). The description of this function is given in the Section 3.3.2 of Chapter 3.

107

Step 3: This step finds the size of L1∆d i.e. number of 1-frequent items present in the incremental database. If size of L1∆d is less than one then no association pattern exists due to incremental database at the given support and algorithm terminates, otherwise function make_concentrator_incr() is called. The input parameters to this function are: ∆d (incremental database) and L1∆d (1-frequent items of incremental database). This function is a slightly modified form of the function make_concentrator() described in Section 3.3.3 of Chapter 3. The function will read all transactions of incremental database and for each transaction containing frequent items from L1∆d put a 1 in location (ti, j) of ξ∆d for each element j ∈ L1∆d (ti is the ith transaction or row of Concentrator, ξ∆d). Thus, Concentrator, ξ∆d will contain L1∆dcolumns and each column is a bit vector corresponding to a particular element of L1∆d. The bit vector associated with item i is denoted as βi and number of 1s in a bit vector βi by βi(1). Step 4: During this step false frequent items are detected and filtered from L1∆d. The approach used for detection and filtering of false frequent items is similar as used in the algorithms MBA and CBA. Description of this step is given in Section 3.4.4 of Chapter 3. Step 5: After detecting and filtering false frequent items from L1∆d size of L1∆d is again tested. If size is less than one, then the algorithm will terminate since no frequent patterns are present in the data due to incremental database. Otherwise concentrators ξ and ξ∆d are joined and resultant Concentrator is designated by ξ’∆d Step 6: This step generates all the frequent itemsets containing at least one item from incremental attributes i.e. from L1∆d and remaining attributes from L1. For each element, i ∈ L1∆d, 2-candidate itemsets will be generated

by calling function

incr_gen_pattern1(). For instance, if z1 ∈ L1∆d then 2-candidate itemsets due to z1 can be obtained by making pairs (z1, e) where ‘e’ is a 1-frequent item and e ∈ L1. Now z1 is appended in L1 and same procedure is repeated for remaining items of L1∆d. 108

Now each 2-candidate itemset is tested whether it is frequent. If it is frequent then it is appended in L2∆d. At the end L2∆d is appended in L∆d. Afterwards, frequent itemsets of size (k-1), where k≥ 3 are used for making candidate itemsets of size k by calling function incr_gen_pattern(). The working of this function is similar to incr_gen_pattern() described earlier. Each candidate itemsets (of size k) is tested. If it is frequent then it is appended in Lk∆d. At the end, Lk∆d is appended in L∆d. This process is repeated till | Lk∆d | ≥ 2. At the end L∆d will contain all frequent itemsets due to incremental attributes.

4.3.4.1 Performance Evaluation for IAP For testing the performance of Incremental Association Pattern mining algorithm (IAP) seven frequent items of dataset T12.I11.D70K had been selected to make the original data (D). Remaining frequent items (four) of this data were used to make the incremental data (∆d). Now execution time required due to incremental attributes was recorded as shown in Table 4.15.

Table 4.15: Execution times in seconds at minimum support of 0.01% Dataset T12.I11.D70K

MBA

CBA

3091

519

Number of Incremental Attributes 1 2 3 4 35

66

109

181

Table indicates that time required for mining frequent itemsets increased with increase in number of frequent items in the incremental data. It is also noted that time required for four incremental attribute was only 181 seconds while MBA and CBA took 3091 and 511 seconds respectively when executed on entire data of T12.I11.D70K. Similar

109

experiment was also performed on dataset T12.I17.D70K. The result of this experiment is summarized in Table 4.16. Table 4.16 Execution times in seconds at minimum support of 0.01%

Dataset

MBA CBA

T12.I17.D70K

4085

713

Number of Incremental Attributes 1 2 3 4 158

375

637

782

It was also observed that performance gap increased with increasing in number of attributes in the original data.

110

Related Documents

Chap4
November 2019 24
Chap4
June 2020 19
Chap4
December 2019 36
Chap4-histomicrobe
May 2020 16
Chap4 Geiger
November 2019 36
Text-chap4
November 2019 29