Tid Items: Association Rule Mining

  • Uploaded by: Allison Collier
  • 0
  • 0
  • July 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Tid Items: Association Rule Mining as PDF for free.

More details

  • Words: 1,071
  • Pages: 18
Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions

TID

Example of Association Rules

Item

{Diaper} → {Beer}, {Milk, Bread} → {Eggs,Coke}, {Beer, Bread} → {Milk},

Implication means co-occurrence, not causality!

Transaction data set can be represented in binary form.

Issues regarding Assoc. Rule Mining • 1. discovering patterns from large transaction data set can be computationally expensive. • 2. Some of the discovered patterns are not meaningful, b’coz they happen just by chance.

Definition: Frequent Itemset • Itemset – A collection of one or more items • Example: {Milk, Bread, Diaper}

– k-itemset • An itemset that contains k items

TID

• Support count (σ ) – Frequency of occurrence of an itemset – E.g. σ ({Milk, Bread,Diaper}) = 2

1

•Support

-Fraction of transactions that contain an itemset E.g. s({Milk, Bread, Diaper}) = 2/5

•Frequent Itemset

2

An itemset whose support is greater than or equal to a minsup threshold

Definition: Association Rule • Association Rule – An implication expression of the form X → Y, where X and Y are disjoint itemsets, i.e. X Y=Ø  – Example: {Milk, Diaper} → {Beer}

• Rule Evaluation Metrics

{Milk , Diaper } ⇒ Beer

• Fraction of transactions that contain both X and Y • Measures how often items in Y appear in transactions that contain X

1

Example:

– Support (s)

– Confidence (c)

TID

σ (Milk , Diaper, Beer) 2 s= = = 0.4 |T| 5

c=

σ (Milk, Diaper, Beer) 2 = = 0.67 σ (Milk, Diaper) 3

2

Why use Support and Confidence? • A rule having low support may occur by chance. • It can be uninteresting as far as business interests are concerned. • So support is used to eliminate such rules. • Confidence measures the reliability of the inference made by the rule. X Y, higher confidence indicates the higher possibility for Y to be present where X is present in a transaction.

5

Association Rule Mining Task • Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold – confidence ≥ minconf threshold • Brute-force approach: – List all possible association rules – Compute the support and confidence for each rule – Prune rules that fail the minsup and minconf thresholds

⇒ Computationally prohibitive!

6

Mining Association Rules Example of Rules:

TID

1

Observations:

Items

{Milk,Diaper} → {Beer} (s=0.4, c=0.67) {Milk,Beer} → {Diaper} (s=0.4, c=1.0) {Diaper,Beer} → {Milk} (s=0.4, c=0.67) {Beer} → {Milk,Diaper} (s=0.4, c=0.67) {Diaper} → {Milk,Beer} (s=0.4, c=0.5) {Milk} → {Diaper,Beer} (s=0.4, c=0.5)

Bread

• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}

• Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements

Mining Association Rules • Two-step approach: 1. Frequent Itemset Generation – Generate all itemsets whose support ≥ minsup

1. Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset

• Frequent itemset generation is still computationally expensive

Frequent Itemset Generation null

A

B

C

D

E

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

CDE

ABCD

ABCE

ABDE

ABCDE

ACDE

BCDE

Given d items, there are 2d possible candidate itemsets

Frequent Itemset Generation • Brute-force approach: – Each itemset in the lattice is a candidate frequent itemset – Count the support of each candidate by scanning the database Transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke N 3 4 Bread, Milk, Diaper, Beer – Match each against every 5 transaction Bread, Milk, Diaper, Cokecandidate

w

List of Candidates

Computational Complexity • Given d unique items: – Total number of itemsets = 2d – Total number of possible association rules:

 d   d − k  R = ∑   × ∑    k   j  = 3 − 2 +1 d −1

d −k

k =1

j =1

d

d +1

If d=6, R = 602 rules

Frequent Itemset Generation Strategies • Reduce the number of candidates (M) – Apriori principle is effectively used to eliminate some candidate itemsets without counting their support count. • Reduce the number of transactions (N) • Reduce the number of comparisons (NM) – Use efficient data structures to store the candidates or transactions – No need to match every candidate against every transaction.

Reducing Number of Candidates • Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent. – Suppose {c,d,e} is a frequent itemset – Then any transaction that contains {c,d,e} must also contain its subsets like {c,d},{c,e}, {d,e}, {c}, {d}, {e}. – So, if {c,d,e} is frequent , then all its subsets must also be frequent.

Conversely, if an itemset say {a,b} is infrequent , then all its supersets will be infrequent.

This strategy of trimming the search space based on support measure is called SUPPORTBASED PRUNING

• Apriori principle holds due to the following property of the support measure:

∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s(Y ) – Support of an itemset never exceeds the support of its subsets – This is known as the anti-monotone property of support

Illustrating Apriori Principle Frequent itemset generation Item Bread Coke Milk Beer Diaper Eggs

Count 4 2 4 3 4 1

Items (Candidate 1-itemsets)

Minimum Support = 3

Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper}

If every subset is considered, 6 C1 + 6C2 + 6C3 = 41 With support-based pruning, 6 + 6 + 1 = 13

Count 3 2 3 2 3 3

Pairs (Candidate 2-itemsets (No need to generate candidates involving Coke or Eggs)

Triplets (3-itemsets) Itemset {Bread,Milk,Diaper}

Count 3

Apriori Algorithm • Method: – Let k=1 – Generate frequent itemsets of length 1 – Repeat until no new frequent itemsets are identified • Generate length (k+1) candidate itemsets from length k frequent itemsets • Prune candidate itemsets containing subsets of length k that are infrequent • Count the support of each candidate by scanning the DB • Eliminate candidates that are infrequent, leaving only those that are frequent

Related Documents


More Documents from "Allison Collier"