Intro Data Mining

  • Uploaded by: api-19730613
  • 0
  • 0
  • July 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Intro Data Mining as PDF for free.

More details

  • Words: 4,777
  • Pages: 87
Data Mining Introduction

Why Mine Data? Commercial Viewpoint 

Lots of data is being collected and warehoused ◦ Web data, e-commerce ◦ purchases at department/ grocery stores ◦ Bank/Credit Card transactions



Computers have become cheaper and more powerful



Competitive Pressure is Strong ◦ Provide better, customized services for an edge (e.g. in Customer Relationship Management) 

Mining Large Data Sets - Motivation 

 

There is often information “hidden” in the data that is not readily evident Human analysts may take weeks to discover useful information Much of the data is never analyzed at all 4,000,000 3,500,000

The Data Gap

3,000,000 2,500,000 2,000,000

Total new disk (TB) since 1995

1,500,000

Number of analysts

1,000,000 500,000 0 1995

1996

1997

1998

1999

What is Data Mining?  Many

Definitions

◦ Non-trivial extraction of implicit, previously unknown and potentially useful information from data ◦ Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

Data Mining is Not ... DM - what it can’t do Automatically find relationships without user intervention when no relationships exist

Data warehousing SQL Queries / Reporting Software Agents Online Analytical Processing (OLAP) Data Visualization 

Origins of Data Mining Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems  Traditional Techniques may be unsuitable due Statistics/ to Machine Learning/ 

◦ Enormity of data ◦ High dimensionality of data ◦ Heterogeneous, distributed nature of data

AI

Pattern Recognition Data Mining

Database systems

A Problem... 

You are a marketing manager for a brokerage company ◦ Problem: Churn is too high  Turnover (after six month introductory period ends) is 40%

◦ Customers receive incentives (average cost: $160) when account is opened ◦ Giving new incentives to everyone who might leave is very expensive (as well as wasteful) ◦ Bringing back a customer after they leave is both difficult and costly 

… A Solution 

One month before the end of the introductory period is over, predict which customers will leave

◦ If you want to keep a customer that is predicted to churn, offer them something based on their predicted value The ones that are not predicted to churn - need no attention

◦ If you don’t want to keep the customer, do nothing 

How can you predict future behavior? ◦ solution : Data Mining



Data Mining : Why now ? 

Changes in the Business Environment ◦ ◦ ◦ ◦ ◦

Customers becoming more demanding Markets are saturated Replace statistician  Better models, less grunge work Many different data mining algorithms / tools available Statistical expertise required to compare different techniques ◦ Build intelligence into the software

◦ 

Drivers ◦ Focus on the customer, competition, and data assets



Enablers ◦ Increased data understanding ◦ Cheaper and faster hardware

 

Growing Base of data

•Data doubling every 20 months in the world •Businesses feel there is value in historical data •Automated knowledge discovery is only way to explore this data 1970

1980

1990

2000

Convergence of Three Key Technologies

Increased Computing Power DM Statistical and Learning Algorithms

Improved Data Collection and Mgmt

Applications Fraud Detection  Loan and Credit Approval  Market Basket Analysis  Customer Segmentation  Financial Applications  E-Commerce & Decision Support  Web and text mining 

Market Analysis and Management (1) 

Where are the data sources for analysis? ◦ Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies



Target marketing ◦ Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.



Determine customer purchasing patterns over time ◦ Conversion of single to a joint bank account: marriage, etc.



Cross-market analysis

Market Analysis and Management (2) 

Customer profiling ◦ data mining can tell you what types of customers buy what products (using techniques : clustering or classification)



Identifying customer requirements ◦ identifying the best products for different customers ◦ use prediction to find what factors will attract new customers



Provides summary information ◦ various multidimensional summary reports ◦ statistical summary information

Fraud Detection and Management (1) 

Applications ◦ widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.



Approach ◦ use historical data to build models of fraudulent behavior and use data mining to help identify similar instances



Examples ◦ auto insurance: detect a group of people who stage accidents to collect on insurance ◦ money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) ◦ medical insurance: detect professional patients and ring of doctors and ring of references

Fraud Detection and Management (2) 

Detecting telephone fraud

◦ Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. ◦ British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. (source: Gartner,2006)



Retail

◦ Analysts estimate that 38% of retail shrink is due to dishonest employees. (Business today,2006)

Other Applications 

Sports ◦ IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat



Astronomy ◦ JPL and the Palomar Observatory discovered 22 quasars with the help of data mining



Internet Web Surf-Aid ◦ IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

Data Mining: On What Kind of Data? Relational databases  Data warehouses  Transactional databases  Object-oriented databases  Time-series data  Text databases and multimedia databases  Heterogeneous 

Data Mining Techniques      



Association Rules & sequential Patterns Classification Clustering Similar Images Text/Web Mining Outliers analysis (using stats)

Data mining - Analysis

12/04/09

20

Examples of Data Mining

Conditional Logic

n

If profession = athlete then age < 30 in 90 % cases

n

n

When Paint is sold, Paint brushes are also sold 85% times

n

Associations

n

Golf balls sales are seasonal with Summer peak and Winter low

n

Trends & Variations

12/04/09

21

Examples of Datamining contd....

Outcome Prediction Forecasting

Deviation Detection

Link Analysis

✠How many people can be expected to respond to a mailer campaign? ✠ ✠What will be the total sales of this product range in next quarter taking into account seasonal and long term trends? ✠ ✠Is this insurance claim likely to be a fraud? ✠ ✠When a person is fired, he is likely to default on credit card payments. 12/04/09

22

Query example   

  

    

suppose that you want to see all the data in a table called SALES. There are three columns (fields) in the table: NAME, DEPT, and SALES. The user wants to see all the data, so he enters this statement: SELECT * from SALES (the * indicates all columns) The user wants to know the total sales by month for all the individuals in Department 50. The statement now becomes a bit more complex. SELECT DEPT,SUM(SALES) FROM SALES GROUP BY DEPT (you have aggregated using SUM and now require a GROUP) ORDER BY 2 DESC (sorting the data based on the second column in descending order)

12/04/09

23

Consider this complex tricky query 

A sales executive wishes to see all the sales for the past three years where profitability has been greater than xx percent. He wishes to see it by month. And where the percentages have been greater than yy percent, he wants to see whether the sales team has been in place during this period or whether there has been personnel turnover. They are looking for territorial versus personnel factors in sales success. He also wishes to see trends in profitability so, where all sales by year have steadily increased for zz percent at least two years in a row, he wishes to see the top five products ranked by profitability.

12/04/09

24

This query requires Sums  Percentages  Grouping  Trends  Time-based analysis  Comparisons 



12/04/09

25

Data mining - Users 

Executives - need top-level insights and spend far less time with computers than the other groups.

 

Analysts may be financial analysts, statisticians, consultants, or database designers.

 

End users are sales people, scientists, market researchers, engineers, physicians, etc. 12/04/09

26

Mining market Around 20 to 30 mining tool vendors  Major tool players: 

◦ ◦ ◦ ◦

Clementine, IBM’s Intelligent Miner, SGI’s MineSet, SAS’s Enterprise Miner.

All pretty much the same set of tools  Many embedded products: 

◦ ◦ ◦ ◦

fraud detection: electronic commerce applications, health care, customer relationship management: Epiphany

Vertical integration: Mining on the web 

Web log analysis for site design: ◦ what are popular pages, ◦ what links are hard to find.



Electronic stores sales enhancements: ◦ recommendations, advertisement: ◦ Collaborative filtering: Net perception, Wisewire ◦ Inventory control: what was a shopper looking for and could not find..

Data Mining 

Through a variety of techniques, data mining identifies nuggets of information in bodies of data.

 

Data mining extracts information in such a way that it can be used in areas such as decision support, prediction, forecasts, and estimation. Data is often voluminous but of low value and with little direct usefulness in its raw form. It is the hidden information in the data that has value.

 

In data mining, success comes from combining your (or your expert’s) knowledge of the data with advanced, active analysis techniques in which the computer identifies the underlying relationships and features in the data.

 



The process of data mining generates models from historical data that are later used for predictions, pattern detection, and more. The technique for building these models is called machine learning, or modeling.

Data Mining Tasks 

Prediction Methods ◦ Use some variables to predict unknown or future values of other variables. 



Description Methods ◦ Find human-interpretable patterns that describe the data. 

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Data Mining Tasks... Classification [Predictive]  Clustering [Descriptive]  Association Rule Discovery [Descriptive]  Sequential Pattern Discovery [Descriptive]  Regression [Predictive]  Deviation Detection [Predictive] 

Modeling Techniques  

Predictive modeling methods include decision trees, neural networks, and statistical models.

 

Clustering models focus on identifying groups of similar records and labeling the records according to the group to which they belong. Clustering methods include Kohonen, k-means, and TwoStep.

 

Association rules associate a particular conclusion (such as the purchase of a particular product) with a set of conditions (the purchase of several other products).

 



Screening models can be used to screen data to locate fields and records that are most likely to be of interest in modeling and identify outliers that may not fit known patterns. Available methods include feature selection and anomaly detection.

Typical Applications Typical applications of data mining techniques include the following:



 

Direct mail. Determine which demographic groups have the highest response rate. Use this information to maximize the response to future mailings.

 

Credit scoring. Use an individual’s credit history to make credit decisions.

 

Human resources. Understand past hiring practices and create decision rules to streamline the hiring process.

 

 

Medical research. Create decision rules that suggest appropriate procedures based on medical evidence.

Typical Applications 

Market analysis. Determine which variables, such as geography, price, and customer characteristics, are associated with sales.

 

Quality control. Analyze data from product manufacturing and identify variables determining product defects.

 

Policy studies. Use survey data to formulate policy by applying decision rules to select the most important variables.

 

Health care. User surveys and clinical data can be combined to discover variables that contribute to health.

A Strategy for Data Mining 



As with most business endeavors, data mining is much more effective if done in a planned, systematic way. Even with cutting-edge data mining tools, such as Clementine, SAA, the majority of the work in data mining requires a knowledgeable business analyst to keep the process on track.

A Strategy for Data Mining 

To guide your planning, answer the following questions:



      



What substantive problem do you want to solve? What data sources are available, and what parts of the data are relevant to the current problem? What kind of preprocessing and data cleaning do you need to do before you start mining the data? What data mining technique(s) will you use? How will you evaluate the results of the data mining analysis? How will you get the most out of the information you obtained from data mining?

The CRISP-DM Process Model 

Cross Industry Standard Process Model for Data Mining







The general CRISP-DM process model includes six phases that address the main issues in data mining. The six phases fit together in a cyclical process designed to incorporate data mining into your larger business practices.

The six phases include:  Business understanding. This is perhaps the most important phase of data mining.  Business understanding includes determining business objectives, assessing the situation, determining data mining goals, and producing a project plan. 

  

Data understanding. Data provides the “raw materials” of data mining. This phase addresses the need to understand what your data resources are and the characteristics of those resources. It includes collecting initial data, describing data, exploring data, and verifying data quality.







Data preparation. After cataloging your data resources, you will need to prepare your data for mining. Preparations include selecting, cleaning, constructing, integrating, and formatting data.



Modeling. sophisticated analysis methods are used to extract information from the data. This phase involves selecting modeling techniques, generating test designs, and building and assessing models.

 









Evaluation.  Once you have chosen your models, you are ready to evaluate how the data mining results can help you to achieve your business objectives.  Elements of this phase include evaluating results, reviewing the data mining process, and determining the next steps.  Deployment.  This phase focuses on integrating your new knowledge into your everyday business processes to  solve your original business problem.  This phase includes plan deployment, monitoring and  maintenance, producing a final report, and reviewing the project. 



Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rule Mining 

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Market -Basket t ransact ions

TID

Exam ple of Associat ion Rules

Item

{Diaper}  {Beer}, {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk},

Implication means co-occurrence, not causality!

Definition: Frequent Itemset 

Itemset ◦ A collection of one or more items  Example: {Milk, Bread, Diaper}

◦ k-itemset

 An itemset that contains k items



Support count () ◦ Frequency of occurrence of an itemset ◦ E.g. ({Milk, Bread,Diaper}) = 2



Support ◦ Fraction of transactions that contain an itemset ◦ E.g. s({Milk, Bread, Diaper}) = 2/5



Frequent Itemset ◦ An itemset whose support is greater than or equal to a minsup threshold

TID

1

Definition: Association Rule

●Association

Rule

–An implication expression of the form X  Y, where X and Y are itemsets –Example: {Milk, Diaper}  {Beer} ●Rule

TID

Evaluation Metrics –Support (s)

Example:

uFraction

of transactions that contain both X and Y

uMeasures

how often items in

Y s appear in transactions that contain X

1

{Milk, Diaper } ⇒ Beer

–Confidence (c)

=

c=

σ (Milk , Diaper, Beer ) 2 = = 0.4 |T| 5

σ (Milk, Diaper, Beer) 2 = = 0.67 σ (Milk, Diaper) 3

2

Association Rule Mining Task 

Given a set of transactions T, the goal of association rule mining is to find all rules having ◦ support ≥ minsup threshold ◦ confidence ≥ minconf threshold ◦



Brute-force approach:



◦ List all possible association rules ◦ Compute the support and confidence for each rule ◦ Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive!

Mining Association Rules

TID Observations:

1

•All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} •Rules originating from the same itemset have identical support but can have different confidence •Thus, we may decouple the support and confidence requirements

Example of Rules:

Items

{Milk,Diaper}  {Beer} (s=0.4, c=0.67) {Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Beer}  {Milk,Diaper} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5)

Bread

Mining Association Rules 

Two-step approach: 1.Frequent Itemset Generation – Generate all itemsets whose support  minsup 

2.Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset  



Frequent itemset generation is still computationally expensive

Frequent Itemset Generation Brute-force approach:

◦ Each itemset in the lattice is a candidate frequent itemset ◦ Count the support of each candidate by scanning the database List of Transactions ◦ Candidate ◦ TID Items ◦ 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs ◦ Milk, Diaper, Beer, Coke ◦ N 3 4 Bread, Milk, Diaper, Beer ◦ 5 Bread, Milk, Diaper, Coke ◦ Match each transaction against every candidate w => Expensive since M = 2d !!! ◦ Complexity ~ O(NMw)

A Naïve Algorithm Transaction ID 100 200 300 400

Items Bread, Cheese Bread, Cheese, Juice Bread, Milk Cheese, Juice, Milk

Find out all possible com binat ions

           

Itemsets Frequency Bread 3 Cheese 3 Juice 2 milk 2 (Bread, Cheese) 2 (Bread, Juice) 1 (Bread, Milk) 1 (Cheese, Juice) 2 (Cheese, Milk) 1 (Juice, Milk) 1 (Bread, Cheese, Juice) 1 (Bread, Cheese, Milk) 0 (Bread, Juice, Milk) 0 (Cheese, Juice, Milk) 1 (Bread, Cheese, Juice, 0 Milk)

minimum support – 50%  Minimum confidence – 75% 



Itemsets Bread Cheese Juice Milk Bread, cheese Cheese, Juice

Frequency 3 3 2 2 2 2

Bread Cheese with confidence of 2/3 =67%  Cheese Bread with confidence of 2/3 =67%  Cheese Juice with confidence of 2/3 =67%  Juice Cheese with confidence of =100%  Rules that have more than the user-specified minimum confidence are called confident 



Improved Naïve Algorithm  Find outItems Transaction Combinations all possible combinations ID 100 Bread, Cheese {Bread, Cheese} 200 Bread, Cheese, {Bread, Cheese}, {Bread, Juice Juice} 300 Bread, Milk {Bread, Milk} 400 Cheese, Juice, Milk {Cheese, {Cheese, Juice}, Juice}, {Bread, {Cheese, Cheese, Juice} Milk}{Cheese, Milk}, {Juice, Juice, Milk}

Find out all possible com binat ions wit h non-zero frequency

Itemsets Frequency Bread 3 Cheese 3 Juice 2 milk 2 (Bread, Cheese) 2 (Bread, Juice) 1 (Bread, Milk) 1 (Cheese, Juice) 2 (Cheese, Milk) 1 (Juice, Milk) 1 (Bread, Cheese, Juice) 1 (Cheese, Juice, Milk) 1

Apriori Algorithm 

Method: 

◦ Let k=1 ◦ Generate frequent itemsets of length 1 ◦ Repeat until no new frequent itemsets are identified

 Generate length (k+1) candidate itemsets from length k frequent itemsets Prune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent

Reducing Number of Comparisons  Candidate counting:

◦ Scan the database of transactions to determine the support of each candidate itemset ◦ To reduce the number of comparisons, store the candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

Transactions

N

TID 1 2 3 4 5

Hash Structure

Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke

k

Buckets

APRIORI Transaction ID 100 200 300 400 500 50% support

Items Bread, Cheese, Eggs, Juice Bread, Cheese, Juice Bread, Milk, Yogurt Cheese, Juice, Milk Cheese, Juice, Milk

Frequent items L1 Item Bread Cheese Juice Milk

Frequency 4 3 4 3

Candidate item pairs C2 Itemsets (Bread, Cheese) (Bread, Juice) (Bread, Milk) (Cheese, Juice) (Cheese, Milk) (Juice, Milk)

Frequency 2 3 2 3 1 2

Frequent items L2 Item Bread, Juice Cheese, Juice

Frequency 3 3

The APRIORI Algorithm Bread Juice with confidence of 3/4 =75%  Juice Bread with confidence of 3/4 =75%  Cheese Juice with confidence of 3/3 =100%  Juice Cheese with confidence of 3/4 =75% 

A larger APRIORI Example

Item Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Item Name Biscuits Bread Cereal Cheese Chocolate Coffee Donuts Eggs Juice Milk Newspaper Pastry Rolls Sugar Tea Yogurt

TID 1 2 3 4 5 6 7 8 9 10 11

Items Biscuits, Bread, Cheese, Coffee, Yogurt Bread, Cereal, Cheese, Coffee Cheese, Chocolate, Donuts, Juice, Milk Bread, Cheese, Coffee, Cereal, Juice Bread, Cereal, Chocolate, Donuts, Juice Milk, Tea Biscuits, Bread, Cheese, Coffee, Milk Eggs, Milk, Tea Bread, Cereal, Cheese, Chocolate, Coffee Bread, Cereal, Chocolate, Donuts, Juice Bread, Cheese, Juice

12 13 14 15 16 17 18 19 20 21 22 23 24 25

Bread, Cheese, Coffee, Donuts, Juice Biscuits, Bread, Cereal Cereal, Cheese, Chocolate, Donuts, Juice Chocolate, Coffee Donuts Donuts, Eggs, Juice Biscuits, Bread, Cheese, Coffee Bread, Cereal, Chocolate, Donuts, Juice Cheese, Chocolate, Donuts, Juice Milk, Tea, Yogurt Bread, Cereal, Cheese, Coffee Chocolate, Donuts, Juice, Milk, Newspaper Newspaper, Pastry, Rolls Rolls, Sugar, Tea 25% support

Frequency count for all items

Item NumberItem Name 1 Biscuits 2 Bread 3 Cereal 4 Cheese 5 Chocolate 6 Coffee 7 Donuts 8 Eggs 9 Juice 10 Milk 11 Newspaper 12 Pastry 13 Rolls 14 Sugar 15 Tea 16 Yogurt

Frequency 4 13 10 11 9 9 10 2 11 6 2 1 2 1 4 2

The frequent 1-itemset or L1 Item Bread Cereal Cheese Chocolate Coffee Donuts Juice

Frequency 13 10 11 9 9 10 11

The 21 candidate 2-itemsets or C2 {Bread, Cereal} {Bread, Cheese} {Bread, Chocolate} {Bread, Coffee} {Bread, Donuts} {Bread, Juice} {Cereal, Cheese} {Cereal, Chocolate} {Cereal, Coffee} {Cereal, Donuts} {Cereal, Juice} {Cheese, Chocolate}

{Cheese, Coffee} {Cheese, Donuts} {Cheese, Juice} {Chocolate, Coffee} {Chocolate, Donuts} {Chocolate, Juice} {Coffee, Donuts} {Coffee, Juice} {Donuts, Juice}

Frequency count of candidate 2-itemsets {Bread, Cereal} {Bread, Cheese} {Bread, Chocolate} {Bread, Coffee} {Bread, Donuts} {Bread, Juice} {Cereal, Cheese} {Cereal, Chocolate} {Cereal, Coffee} {Cereal, Donuts} {Cereal, Juice} {Cheese, Chocolate}

9 8 4 8 4 6 5 4 5 4 6 4

candidate 2-itemsets or C2 {Cheese, Coffee} 9 {Cheese, Donuts} 3 {Cheese, Juice}

4

{Chocolate, 1 Coffee} {Chocolate, 7 Donuts} {Chocolate, Juice} 7 {Coffee, Donuts} 1 {Coffee, Juice}

2

{Donuts, Juice}

9

The frequent 2-itemset or L2 {Bread, Cereal} {Bread, Cheese} {Bread, Coffee} {Cheese, Coffee} {Chocolate, Donuts} {Chocolate, Juice} {Donuts, Juice}

9 8 8 9 7 7 9

Candidate 3-itemsets or C3 {Bread, Cereal, {Bread, Cereal, Cheese} {Bread, Cheese, Coffee} {Chocolate, Donuts, Coffee} Juice}

4 4 8 7

Frequent 3-itemsets or L3 {Bread, Cheese, {Chocolate, Donuts, Coffee} Juice}

8 7

Confidence of association rules from {Chocolate, Donuts, Juice} Rule N M P MP NP NM

Support of BCD MP 7 NP 7 NM7 N 7 M 7 P 7

Frequency of 9LHS 10 11 9 7 7

Confidence 0.78 0.70 0.64 0.78 1.0 1.0

Confidence of association rules from {Bread, Cheese, Coffee} Rule B CD C BD D BC CD B BD C BC D

Support of BCD 8 8 8 8 8 8

Frequency of 13 LHS 11 9 9 8 8

Confidence 0.61 0.72 0.89 0.89 1.0 1.0

All association rules Cheese Bread Cheese Coffee Coffee Bread Coffee Cheese Cheese, Coffee Bread Bread, Coffee Cheese Bread, Cheese Coffee Chocolate Donuts Chocolate Juice Donuts Chocolate Donuts Juice Donuts, Juice Chocolate Chocolate, Juice Donuts Chocolate, Donuts Juice Bread Cereal Cereal Bread

Closed and Maximal Itemsets Closed itemset – a frequent itemset X such that there exists no superset of X with the same support count as X.  Maximal itemset – a frequent itemset Y is maximal if it is not a proper subset of any other frequent itemset.  A maximal itemset is a closed itemset, but a closed itemset is not necessarily a maximal itemset. 

Maximal vs Closed Itemsets Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets

A transaction database to illustrate closed and maximal itemsets Transaction ID 100 200 300 400 500

Items Bread, Cheese, Juice Bread, Cheese, Juice, Milk Cheese, Juice, Egg Bread, Juice, Milk, Egg Milk, Egg

Frequent itemsets for the database Itemset Support Closed? Maximal? {Bread} 3 No No {Cheese} 3 No No {Juice} 4 Yes No {Milk} 3 Yes No {Egg} 3 Yes No {Bread, Cheese} 2 No No {Bread, Juice} 3 Yes No {Bread, Milk} 2 No No {Cheese, Juice} 3 Yes No {Juice, Milk} 2 No No {Juice, Egg} 2 Yes Yes {Milk, Egg} 2 Yes Yes {Bread, Cheese, 2 Yes Yes Juice} {Bread, Juice, 2 Yes Yes Milk}

Both? No No No No No No No No No No Yes Yes Yes Yes

Association models 

associate a particular conclusion (such as a decision to buy something) with a set of conditions.   The Generalized Rule Induction (GRI) node discovers association rules in the data. For example, customers who purchase razors and aftershave lotion are also likely to purchase shaving cream. GRI extracts rules with the highest information content based on an index that takes both the generality (support) and accuracy (confidence) of rules into account. GRI can handle numeric and categorical inputs, but the target must be categorical.  

Association models 









The Apriori node extracts a set of rules from the data, pulling out the rules with the highest information content. Apriori offers five different methods of selecting rules and uses a sophisticated indexing scheme to process large datasets efficiently. For large problems, Apriori is generally faster to train than GRI; it has no arbitrary limit on the number of rules that can be retained, and it can handle rules with up to 32 preconditions. Apriori requires that input and output fields all be categorical but delivers better performance because it is optimized for this type of data.



     



At the end of the processing, a table of the best rules is presented. Unlike a decision tree, this set of association rules cannot be used directly to make predictions in the way that a standard model (such as a decision tree or a neural network) can. This is due to the many different possible conclusions for the rules. Another level of transformation is required to transform the association rules into a classification ruleset. Hence, the association rules produced by association algorithms are known as unrefined models. Although the user can browse these unrefined models, they cannot be used explicitly as classification models unless the user tells the system to generate a classification model from the unrefined model. This is done from the browser through a Generate menu option.

Association Rule Mining X & Y appear in only 10% o the transactions but whenever X appears there is an 80 % chance that Y also appears.  The 10 % presence is called – Support ( or prevalance)  80 % chance is called – confidence (or predictability)  High level of support – the rule is frequent enough for the business to be interested in it.  High level of confidence – the rule is true often enough to justify a decision based on it. 

Association Rule Mining  

Total number of transactions = N Support(X) = (Number of times X appears) / N 



Support(XY) = (Number of times X and Y appears together) / N 







= P(X∩Y)

Confidence of (X Y) =Support(XY)/Support(X) 



= P(X)

= P(X∩Y) / P(Y) = P(Y∣X)

P(Y∣X) is the probability of Y once X has taken place, also called the conditional probability of Y

Association Rule Mining Lift – used to measure the power of association between items that are purchased together.  How much more likely an item Y is to be purchased if the customer has brought the item X that has been identified as having an association with the first item Y, compared to the likelihood of Y being purchased without the the other item being purchased.  P(Y∣X) / P(Y) 

Related Documents

Intro Data Mining
July 2020 2
Data Mining
May 2020 23
Data Mining
October 2019 35
Data Mining
November 2019 32
Data Mining
May 2020 21
Data Mining
May 2020 19