07. Closet - An Efficient Algorithm For Mining Frequent

  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View 07. Closet - An Efficient Algorithm For Mining Frequent as PDF for free.

More details

  • Words: 2,561
  • Pages: 8
substantially reduces the number of rules to

VIGNAN INSTOFINFORMATION TECHNOLOGY

be presented. An efficient algorithm CLOSET, for mining

closed

itemsets,

with

the

development of three techniques:

DUVVADA,

(1) Applying a compressed frequent

VISAKHAPATNAM.

pattern tree FP-tree structure for mining

closed

itemsets

without

candidate generation,

TITLE: CLOSET- An Efficient

(2) Developing a single prefix path

Algorithm for Mining Frequent Closed

compression technique to identify

Itemsets

frequent closed itemsets quickly, and (3) Exploring

PRESENTED BY :

a

partition

bases

projection mechanism for scalable

P.BHAKTHAVATHSALANAIDU

mining in large databases. CLOSET

K.K.S.SWAROOP

is efficient and scalable over lager

[email protected]

databases and is faster than the

[email protected]

previously proposed methods. INTRODUCTION: It has been well recognized that

CLOSET: An Efficient

frequent pattern mining plays an essential

Algorithm for Mining

role in many important data mining tasks, e.g.

Frequent Closed Itemsets

associations,

sequential

patterns

,

episodes, partial periodicity, etc. However, it

ABSTRACT

is also well known that frequent pattern

Association mining may often derive

mining often generates a very large number

an undesirably large set of frequent itemsets

of frequent itemsets and rules, which

and association rules. Recent studies have

reduces

proposed an interesting alternative: mining

effectiveness of mining since users have to

frequent

sift through a large number of mined rules to

closed

itemsets

and

their

corresponding rules, which has the same power

as

association

mining

but

not

only

efficiency

but

also

find useful ones. There is an interesting alternative: instead of mining the complete set of

frequent itemsets and their associations,

rules of the form “A⇒B” where A and B are

association mining only their corresponding

sets of items, it is defined as

rules. An important implication is that mining frequent closed itemset has the same

Support(A⇒B) = # tuples containing both A and B

power as mining the complete set of

total

#

of

frequent itemsets, but it will substantially

tuples

reduce redundant rules to be generated and

Association rules that satisfy both a user

increase both efficiency and effectiveness of

specified minimum confidence and user

mining.

specified minimum support threshold are

Definition:

Association

rule

mining

referred to as Strong Association Rules.

searches for interesting relationships among

Association Rule Mining is a two step

items in a given data set.

process:

Interestingness Measures:

1. Find all frequent itemsets: each of

Certainty: Each discovered pattern should

these itemsets will occur at least as

have a measure of certainty associated with

frequently

it

minimum support count.

that

assesses

the

validity

or

as

a

pre-determined

“trustworthiness” of the pattern. A certainty

2. Generate strong association rules

measure for association rules of the form

from the frequent itemsets: These

“A⇒B”, where A and B are sets of items, is

rules must satisfy minimum support

Confidence. Given a set task-relevant data

and minimum confidence.

tuples, the confidence of “is defined as

The Apriori Algorithm: Finding Frequent

“A⇒B” is defined as

Itemsets Using Candidate Generation:

Confidence

(A⇒B)

=

#

tuples

Apriori is an influential algorithm for mining

containing both A and B # tuples

frequent

association

rules.

itemsets The

for

names

Boolean of

the

containing A

algorithm are based on the fact that the

Utility :

algorithm uses prior knowledge of frequent

The potential usefulness of a its

itemset properties. Apriori employs an

interestingness. It can be estimated by a

iterative approach known as a level-wise

utility function, such as support. The support

search where k-itemsets are used to explore

of an association pattern refers to the

(k+1)-itemsets. The finding of each Lk

percentage of task-relevant data tuples for

requires one full scan of the database.

which the pattern is true. For association

Two step process of finding frequent items:

pattern

is

a

factor

defining

The join step: To find Lk, a set of candidate

TID

k-itemsets is generated by joining Lk-1 with T100 T200 T300 T400 T500 T600 T700 T800 T900

itself. This set of candidates is denoted Ck. Let l1 and l2 be itemsets in Lk-1. The notation li[j] refers to the jth item in li. By convention, Apriori assumes that items within a transaction or itemset are sorted in lexicographic order. The join, Lk-1  Lk-1,

List Of Item_IDs I1,I2,I5 I2,I4 I2,I3 I1,I2,I4 I1,I3 I2,I3 I1,I3 I1,I2,I3,I5 I1,I2,I3

is performed; where members of Lk-1 are joinable if there first (k-2) items are in

1) In the first iteration of the algorithm,

common. That is, members l1 and l2 are

each items is a member of the set of

joined

candidate

if

1-itemsets,

C1.

The

(l1[1]=l2[1])∧(l1[2]=l2[2])∧(l1[3]=l2[3])∧(l1[4]=

algorithm simply scans all of the

l2[4]). The condition l1[k-1]
transactions in order to count the

ensures that no duplicates are generated. The

number of occurrence of each item.

resulting itemset formed by joining l1 and l2

2) Suppose

that

the

minimum

is l1[1]l1[2]…l1[k-1]l2[k-1]

transactions support count required is

The prune step: Ck is a superset of Lk, that

2. The set of frequent 1-itemsets, L1,

is its members may or may not be frequent,

can then be determined. It consists of

but all of the frequent, but all of the frequent

the candidate 1-itemsets satisfying

k-itemsets are included in Ck. A scan of the

minimum support.

database to determine the count of each

3) To discover the set of frequent 2-

candidate in Ck would result in the

itemsets, L2 the algorithm uses L1

determination of Lk. Ck, however can be

1X1 L1 to generate a candidate set of

huge, and so this could involve heavy

2-itemsets C2.

computation. To reduce the size of Ck, the Apriori property is used as follows. Any (k1)-itemset that is not frequent cannot be a

In this way we will find candidate sets

subset of a frequent either and so can be

until a candidate set is null.

removed from Ck. This subset testing can be done quickly by maintaining a hash tree of all frequent itmesets. Example:-



for each frequent itemset l, generate all nonempty subsets of l.



for every nonempty subset of l, output the rule “s⇒(l-s)” if support count(l) ≥ min_conf, support count(s) where min_conf is the minimum

confidence threshold. Since the rules are generated from frequent itemsets, each one automatically satisfies minimum support. Frequent itemsets can be stored ahead of time in hash tables along Generating

Association

Rules

from

with their counts so that they can be

Frequent Itemsets

accessed quickly.

Once the frequent itemsets from transactions

Eg:- Suppose the data contain the frequent

in a database D have been found, it is

itemset

straightforward

association rules that can be generated from

to

generate

strong

l=

{I1,I2,I5}.

What

are

the

association rules from them. This can be

l?

done using the following equation for

The nonempty subsets of l are {I1, I2}, {I1,

confidence.

conditional

I5}, {I2, I5}, {I1}, {I2} and {I5}. The

probability is expressed in terms of itemset

resulting association rules are as shown

support count.

below, each listed with its confidence.

Confidence (A⇒B) = support count (A U B)

I1∧I2⇒I5, confidence = 2/4 = 50%

Where

the

Support count (A)

I1∧I5⇒I2, confidence = 2/2 = 100%

where support count (AUB) is the number of

I2∧I5⇒I2, confidence = 2/2 = 100%

transactions containing the itemsets AUB,

I1⇒I2∧I5, confidence = 2/6 = 33%

and support count (A) is the number of

I2⇒I1∧I5, confidence = 2/7 = 29%

transactions containing the itemset A. Based

I5⇒I1∧I2, confidence = 2/2 = 100%

on this equation, association rules can be

If the minimum confidence threshold is, say,

generated as follows:

70% then only the second, third and last

rules above are output, since these are the only ones generated that are strong. Mining

Frequent

Itemsets

An

FP-tree

is

then

constructed as follows. First, create the root without

of the tree, labeled with “null”. Scan

Candidate Generation

database D a second time. The items in each

The Apriori algorithm suffers from two non-

transaction are processed in L order and a

trivial costs:

branch is created for each transaction. For

1) It may need to generate a huge number of candidate sets.

example, the scan of the first transaction, “T100:I1, I2, I5”, which contains three items

2) It may need to repeatedly scan the

(I2, I1, I5) in L order, leads to the

database and checks large set of

construction of the first branch of the tree

candidates by pattern matching

with three nodes:<(I2:1, (I1:1), (I5:1)>,

An interesting method that mines the

where I2 is linked as child of the root, I1 is

complete set of frequent itemsets without

linked to I2 and I5 is linked to I2. The

candidate generation is called frequent-

second transaction, T200, contains the items

pattern growth of simply FP-growth,

I2 and I4 in L order, which would result in a

which adopts a divide-and –conquer strategy

branch where I2 is linked to the root and I4

as

database

is linked to I2. However this branch would

representing frequent items into a set of

share a common prefix, (I2:2), with the

conditional databases, each associated with

existing path for T100. Therefore, we

one frequent item, and mine each such

instead increment the count of the I2 node

database separately.

when considering the branch to be added for

follows:

Reexamine

compress

the

mining

the

of

transaction

a transaction, the count of each node along a

database, using the frequent-pattern growth

common prefix is incremented by 1, and

approach.

nodes for the item following the prefix are

The first scan of the database is the same as

created and linked accordingly.

Apriori, which derives the set of frequent

To facilitate tree traversal, an item

items (1-items) and their support counts. Let

header table is built so that each item points

the minimum count be 2. The set of frequent

to its occurrence in the tree via a chain of

items is sorted in the order of descending

node-links. The tree obtained after scanning

support count. This resulting set or list is

all of the transactions is

denoted L. Thus, we have L=[I2: 7,I1: 6,I3: 6,I4: 2,I5: 2].

growth is achieved by the concatenation tree. Let’s first consider I5 which is the last item in L, rather than the first. The reasoning behind this will become apparent as we explain the FP-tree mining process. I5 occurs in two branches of the FP-tree. The paths formed by these branches are < (I2, I2, I5:1)> and < (I2, I1, I3, I5:1)>. Therefore considering I5 as a suffix, its corresponding two prefix paths are < (I2I1:1)> and < (I2, I1, I3:1) >, which form its conditional pattern base. Its conditional FP-tree contains only a single path, (I2:2, I1:2); I3 is not included because its support count of 1 is less than the minimum support count. The Item Conditional Conditional Frequent pattern I5

I4

FP-tree

base {(I2 I1: 1) ,
patterns

single path generates all the combinations of frequent patterns: I2 I5:2, I1 I5:2, I2 I1 I5:2. In the same way find the frequent

generated I1: I2 I5: 2,

itemsets for all other Items. The FP-growth

(I2 I1 I3: 2>

I1 I5: 2,

method transforms the problem of finding

1)} {(I2 I1: 1) ,

I2 I1 I5: 2 I2 I4: 2

long frequent patterns to looking for shorter

(I2: 1)} The mining of the FP-tree proceeds as follows. Start from each frequent length-1 pattern, construct its conditional pattern base (a sub database which consists of the set of prefix paths in the FP-tree co-occurring with the suffix pattern), then construct its (conditional) FP-tree and perform mining recursively on such a tree. The pattern

ones recursively and then concatenating the suffix. It uses the least frequent items as suffix, offering good selectivity. The method substantially reduces the search costs. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets PROBLEM DEFINITION: An itemset X is contained in transaction if X⊆ Y. Given a

transaction database TDB, the support of an

large database, which is called the frequent

itemset X, denoted as sup(X), is the number

closed itemset mining problem

of transactions in TDB which contain X. An association rule R: X⇒Y is an implication between two itemsets X and Y where X, Y⊂I and X∩Y =∅. The support of the rule, denoted as sup(X⇒Y), is defined as sup (XUY). The confidence of the rule, denoted as

conf(X⇒Y),

is

defined

as

sup

(XUY)/sup(X). The requirement of mining the complete set of association rules leads to two problems: 1) There may exist a large number of frequent itemsets in a transaction database, especially when the support threshold is low. 2) There may exist a huge number of association rules. It is hard for users to comprehend and manipulate a huge number of rules. An interesting alternative to this problem is the mining of frequent closed itemsets and their corresponding association rules. Frequent closed itemset: An itemset X is a closed itemset if there exist no itemset X’ such that (1) X’ is a proper superset of X and (2) every transaction containing X also contains X’. A closed itemset X is frequent if its support passes the given support threshold. How to find the complete set of frequent closed itemsets efficiently from

For the transaction database in table1 with min_sup = 2, the divide and conquer method for mining frequent closed itemset. 1) Find frequent items. Scan TDB to find the set of frequent items and derive a global frequent item list, called f_list, and f_list = {c:4, e:4, f:4, a:3, d:2}, where the items are sorted in support descending order any infrequent item, such as b are omitted.. 2) Divide search space. All the frequent closed itemsets can be divided into 5 non-overlap subsets based on the f_list: (1) the ones containing items d,(2) the ones containing item a but no d, (3) the ones containing item f but no a not d, (4) the ones containing e but no f, a nor d, and (5) the one containing only c. once all subsets are found, the complete set of frequent closed itemsets is done.

In the same way find the frequent closed itemsets for a, f, e, and c. 4) The set of frequent closed itemsets fund is {acdf :2, a :3, ae :2, cf :4, cef :3, e : 4} Optimization 1: Compress transactional and conditional databases using FP-tree structures. FP-tree compresses databases for frequent

itemset

mining.

Conditional

databases can be derived from FP-tree efficiently. Optimization 2: Extract items appearing in 3) Find subsets of frequent closed

every transaction of conditional databases.

itemsets. The subsets of frequent

Optimization 3: Directly extract frequent

closed itemsets can be mined by

closed itemsets from FP-tree.

constructing corresponding

Optimization 4: Prune search branches.

conditional database and mine each recursively.

PERFORMANCE STUDY

Find frequent closed itemsets containing d.

Comparison of A-close, CHARM,

Only transaction containing d are needed.

and CLOSET, CLOSET out performs both

The d-conditional database, denoted as

CHARM and A-close. CLOSET is efficient

TDB|d, contains all the transactions having d,

and scalable in mining frequent closed

which is {cefa, cfa}. Notice that item d is

itemsets in large databases. It is much faster

omitted in each transaction since it appears

than A-close, and also faster than CHARM.

in every transaction in the d-conditional database.

CONCLUSION

The support of d is 2. Items c, f and a appear

CLOSET leads to less and more

twice respectively in TDB|d. Therefore, cfad:

interesting association’s rules then the other

2 is a frequent closed itemset. Since this

previously proposed methods.

itemset covers every frequent items in TDB|d finishes.

Related Documents