Overview Data Mining

November 2019
PDF

Download

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View Overview Data Mining as PDF for free.

More details

Words: 8,847
Pages: 57

Preview
Full text

Chapter: Introduction to Data Warehousing and Data Mining

Introduction to Data Mining (many slides in this section Are used courtesy of Carrig E m e r g i n g T e c h n o l o g i e s Ph: 410- 553- 6760 www.c a r r i g e t. c o m )

1

Introduction Introduction to to Data Data Mining Mining

• • • •

What is data mining Need a data warehouse Need to clean up data for mininig Lots of algorithms

2

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

1

Chapter: Introduction to Data Warehousing and Data Mining

Data Data Mining Mining Introduction Introduction • Data Mining examines a database and looks for patterns in the data • A data warehouse by itself will respond to queries from users – It will not tell users about patterns in data that users may not have thought about – To find patterns in data, data mining is used to try and mine key information from a data warehouse

3

Typical Typical Data Data Architecture Architecture Finance OLTP Application

Inventory OLTP Application

Inventory OLTP Data

Enterprise Information System

Finance OLTP Data

Sales OLTP Application

SSaalleess OLTP OLTP Data Data

Data Warehouse

New York Data Mart

Finance Subject Area

Inventory Subject Area

Sales Subject Area

California Data Mart

4

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

2

Chapter: Introduction to Data Warehousing and Data Mining

Advantages Advantages of of Data Data Mining Mining • Data mining allows companies to exploit information and use this to obtain competitive advantage. • Data mining helps identify trends such as: – why customers buy certain products – – – –

ideas for very direct marketing ideas for shelf placement training of employees vs. employee retention employee benefits vs. employee retention

5

Implementing Implementing Data Data Mining Mining • Apply data mining tools to run data mining algorithms against data • There are two approaches: – Copy data from the Data Warehouse and mine it – Mine the data within the Data Warehouse

• Popular tools use a variety of different data mining algorithms: – – – – – – –

regression association rules genetic algorithms decision trees neural networks clustering classification 6

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

3

Chapter: Introduction to Data Warehousing and Data Mining

Data Data Mining Mining using using Separate Separate Data Data • You can move data from the data warehouse to data mining tools – Advantages – Data mining tools may organize data so they can run faster – Disadvantages – Could be very expensive to move large amounts of data Data Warehouse

Data Mining Tool Copy of data made by the Data Mining Tool 7

Data Data Mining Mining Against Against the the Data Data Warehouse Warehouse • Data mining tools can access data directly in the Data Warehouse – Advantages – No copy of data is needed for data mining – Disadvantages – Data may not be organized in a way that is efficient for the tool Data Warehouse

Data Mining Tool

8

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

4

Chapter: Introduction to Data Warehousing and Data Mining

Application Application Areas Areas • • • • • •

Database Marketing Time-series prediction Detection Probability Estimation Information compression Sensitivity Analysis

9

Database Database Marketing Marketing • Response Modeling – Improve customer response by targeting prospects that are predicted as most likiely to respond to a particular advertisement or promotion (customer relationship management)

• Benefits – Improved Return on Investment (ROI) – Improved inventory control and supply-side management – Improved customer relationships and retention

• Lift Chart – A lift chart is often used to visualize and measure response modeling performance. – Shows how careful selection of customers ensures good response

10

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

5

Chapter: Introduction to Data Warehousing and Data Mining

Lift Lift Chart Chart

100 90 80 70 60 50 40 30 20 10 0

random careful

0 10 20 30 40 50 60 70 80 90 11

Cross Cross Selling Selling • Leverage existing customer base by selling them additional products or services • Similar to response modeling but only target existing customers • Coming histrorical purchase data with demographic data – Lifestyle – Credit reports – Purchases of other products

• Targeted marketing – Only contact those customers that look promising – Create specialized mailing lists

12

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

6

Chapter: Introduction to Data Warehousing and Data Mining

Cross Cross selling selling •

Input – Purchase history – – – –

Purchase date Amount of purchase Number of purchases Products purchased

– Demographics – Age – Gender – Household size – Other – Did customer request information – Number of solicitations sent – How customer heard of the company

•

Output – Measure of how “approachable” the customer would be for a given product 13

Time-Series Time-Series Prediction Prediction • Time series – Sequence of numerical vectors collected over time – Trajectory of a baseball – Stock prices

• Estimate one or more values appearing later in the sequence • Important aspect is you have to discard a lot of useless data.

14

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

7

Chapter: Introduction to Data Warehousing and Data Mining

Detection Detection •

Identify the existence or the occurrence of a condition

•

Fraud detection – May have only a few cases of fraud, have to find out what is different about these transactions

•

Intrusion Detection

•

Need to compute the cost of missing a fraud and the cost of falsely accusing someone of fraud. Can define classes of a fault. For baking cookies you might have:

– Find patterns that indicate when an attack is made on an network

•

– – – – – –

Good-no defect Broken Underbaked Overbaked Too small Too few chips 15

Others Others • Probability Estimation – Approximate the likelihood of an event given an observation.

• Information Compression – Can be viewed as a special type of estimation problem. For a given set of data, estimate the key components that be can be used to construct the data.

• Sensitivity Analysis – Understand how changes in one variable affect others. – Identify sensitivity of one variable on another (find out if dependencies existg).

16

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

8

Chapter: Introduction to Data Warehousing and Data Mining

Training Training Data Data vs. vs. Test Test Data Data • Training data is known results that are used to make predictions on test data (unknown results). • Steps: – – – – –

Step 1: Split training data into two halves (A & B) Step 2: Build model against A Step 3: Run model on B Step 4: See how accurate model is Step 5: If accurate, use it, if not return to Step 1 AGE 20

SALARY 30,000

BUYS MAGAZINE

25

35,000

Y

30

40,000

N

35

45,000

Y

40

50,000

N

45

55,000

N

Y

17

Typical Typical Use Use of of Training Training Data Data Steps: 1

A

B

2

5

AGE

SALARY

BUYS MAGAZINE

20

30,000

Y

40

50,000

N

35

45,000

Y

45

55,000

N

25

35,000

Y

30

40,000

N

MAKE MODEL FROM “A” If AGE >= 20 AND AGE <= 40 AND SALARY >= 30,000 AND SALARY >= 45,000 THEN BUYS MAGAZINE

BACK TO STEP 1!

3

APPLY IT TO “B” It works 2 out of 3 times!! 33 percent error

?

4

IF 18

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

9

Chapter: Introduction to Data Warehousing and Data Mining

Decision Decision Trees Trees • Used to represent a sequence of decisions to arrive at a predicted result • More general than matrices or flat tables ROOT

SALARY > 50,000

Y

N

LEAVES AGE > 20

AGE < 40

Y AGE >30

Y BUYS MAGAZINE

Y

N

N AGE<20

AGE<30

AGE >50

Y

Y

BUYS MAGAZINE

BUYS MAGAZINE

N AGE>25

19

Prune Prune the the Tree Tree • Pruning the Tree means to analyze a portion of the Decision Tree.

AGE > 20

Y

N AGE<20

AGE<30

Y BUYS MAGAZINE

N AGE>25

20

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

10

Chapter: Introduction to Data Warehousing and Data Mining

Pros Pros and and Cons Cons of of Decision Decision Trees Trees • Pro – Easy to understand why they make decisions

• Con – Can spend your life tuning the Decision Trees – Can take a long time to build

21

Neural Neural Networks Networks • Neural Networks Input Layer

NODE

Hidden Layers

Output Layer

22

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

11

Chapter: Introduction to Data Warehousing and Data Mining

How How aa Node Node Works Works

Weight (a0)(w 0) + (a 1)(w1) (a2)(w 2)

a0 w0

Outgoing activation NODES

WT

w1 a1 w2

Threshold If W > 0 Output=1

a2

OTHERWISE Output=0

23

How How aa Node Node Works Works (cont.) (cont.)

Weight

Legend Brown = 1 Blue = 2

a0 Age = 24

a0

a1 Hair = Brown

a1

a2

Eyes = Blue

a0

a2

Age = 30

a0

a1 Hair = Brown

a1

a2 Eyes = Blue

Threshold

(a0)(w 0) + (a 1)(w1) (a2)(w 2)

a2

W 0 = -0.6

-15.7

W 1 = .1

WT

W 2 = -0.7

W 0 = .5 W 1 = .2 W 2 = .4

If W > 0 Output=1 OTHERWISE Output=0

0

16

WT

1

24

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

12

Chapter: Introduction to Data Warehousing and Data Mining

Neural Neural Networks Networks • Back Propagation Input Layer

Hidden Layers

Output Layer

25

Neural Neural Networks Networks • Pros – There are lots of Neural Network algorithms and some are very accurate – OCR (Optical Character Recognition) is a big success

• Cons – Does not explain why it makes a decision – Back propagation is a very computationally expensive process

26

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

13

Chapter: Introduction to Data Warehousing and Data Mining

K-Nearest K-Nearest Neighbor Neighbor • Idea is that related items are very similar in values.

SALARY

40

30

20

10

10

15

20

25

AGE 27

Research Research in in Data Data Mining Mining • Research is on-going – – – – –

Knowledge and Data Discovery VLDB (conference on very large databases) SIGMOD (special interest group on management of data) CIKM (conference on information and knowledge management) Conference on Data and Knowledge Engineering

28

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

14

Chapter: Introduction to Data Warehousing and Data Mining

Products Products • Large Scale – A few big ones, very expensive – Oracle (Darwin) – IBM (Intelligent Miner) – SAS (Enterprise Miner) – SPSS (Clementine) – Silicon Graphics (Mindset), more on visualization

• Small Scale – Lots of these (desktop data miniing) – Pattern Recognition Workbench – CART, etc.

29

Data Data Mining: Mining: Summary Summary • Data mining attempts to find patterns in data that we did not know about • Often data mining is just a new buzzword for statistics • Data mining differs from statistics in that large volumes of data are used • Many different data mining algorithms exist and we will discuss them in the course • Examples – identify users who are most likely to commit credit card fraud – identify what attributes about a person most results in them buying product x. 30

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

15

Chapter: Introduction to Data Warehousing and Data Mining

Overview Overview Data preparation Data preparation for for Data Data Mining Mining

31

Overview Overview • • • •

Development Process Defining the Problem Collection data Preparing data

32

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

16

Chapter: Introduction to Data Warehousing and Data Mining

Overview Overview of of the the process process Define the Problem Define the Problem

Choose Algorithm

Define the Problem

Develop Model with Training data

Evaluate The Model

Define the Problem

Repeat Whole Process 33

Define Define the the problem problem • What problems are suitable • Evaluating the results • What are the inputs and the outputs

34

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

17

Chapter: Introduction to Data Warehousing and Data Mining

Suitability Suitability of of the the Problem Problem • Don’t Use – If there is a nice formula to answer the question (how long will it take light to travel from Mars to Pluto?)

• Use – – – –

No good existing solution (nothing to lose) Lots of historical data is lying around Problem is not easily understood Problem can be characterized as input-to-output relationship

– Existing models have lots of problems

35

Modeling Modeling the the Error Error • Sometimes a known technique results in a certain amount of error. • If data mining can be used to estimate the error, this can be useful. • You can then use data mining in conjunction with a known technique.

36

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

18

Chapter: Introduction to Data Warehousing and Data Mining

Evaluate Evaluate Results Results • • • • •

Level of accuracy needed What level of accuracy defines success How will we benchmark performance What other techniques will be tested What kind of data will be used to evaluate the models.

37

Classification Classification or or Estimation Estimation • Classification says we need to categorize data (good customer, bad customer, etc.) • Estimation we want to define an estimate of how good a customer will be. • Very fine line.

38

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

19

Chapter: Introduction to Data Warehousing and Data Mining

Examples Examples of of Classification Classification • Classification – – – –

Predicting likelihood of a prospect responding to a marketing offer Automatically classifying bitmaps of handwritten characters (A-Z). Identifying cookies as “tasty”, “burnt”, “hopeless” Classifying bank mortgage as “good risk” or “bad risk”

39

Examples Examples of of Estimation Estimation • Predict the amount of money a customer will spend in next six months. • Estimate the market price for an option based on time to expiration. • Estimate the highest temperature for Chicago in the winter of 2008. • Estimate who will win the super bowl.

40

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

20

Chapter: Introduction to Data Warehousing and Data Mining

Overlap Overlap of of classification classification and and estimation estimation • We could either classify cookies as “excellent”, “bad”, “tasty”, “hopeless” or we could estimate their “quality value” on a scale of 1-10 with 1 being “hopeless” and 10 being “excellent”. • In general classification is better if there is a natural grouping of classes.

41

Rules Rules to to choose choose between between classification classification and and estimation estimation • Are there a finite number of classes • Classes should be independent such that a “class 1” is not any closer to a “class 2”. The difference in “young”, “teenage”, “middleaged” is really not a good classification problem (too easy to miss and land in the wrong class). • Good classes might be to predict someones occuptation (e.g; farmer, baseball player, janitor, educator). • Discrete or continuous output. If output is continuous estimation is probably better.

42

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

21

Chapter: Introduction to Data Warehousing and Data Mining

Input Input and and Output Output • For modeling, the output is the dependent variable • For estimation, the output is the classification of the input. • Start with small number of inputs and small number of outputs. • Most data mining is time consuming because key data is often missing or inaccurate. • Inputs should be casually related to output. – E.g. not like the “super bowl effect” where if the NFC wins, stock prices rise.

43

Collecting Collecting Data Data • What data to collect – Collect example patterns – Use recent data if system changes – Start with a small set of data

44

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

22

Chapter: Introduction to Data Warehousing and Data Mining

How How to to Collect Collect Data Data • Choose an appropriate sample rate for time series • Make sure data measurements are consistent (can’t use some in meters and some in yards) • Leave out non-essential input variables • Make sure variables that vary have different values in your training set • Make sure no major physical changes have occurred since you last collected data (e.g. new baseball stadium would affect baseball predictions).

45

How How much much data data in in enough? enough? • Test by training on a small subset and compare with a large subset. • If the model does not change you might be done. • No good answer, it turns out to be very hard to compute how much data is needed for most data mining algorithms. • Note that sampling is often an effective technique, but if a large dataset is available, data mining algorithms are becoming scalable enough so that sampling is not a necessity.

46

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

23

Chapter: Introduction to Data Warehousing and Data Mining

Preparing Preparing Data Data • Very few organizations have “clean” data. • Data entry systems often lack integrity rules or many different systems are used to populate a data mining input set. ;

47

Preparing Preparing Data Data • Handling missing data • Choosing numerical encodings (transforming categorical data) • Discarding erroneous data

48

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

24

Chapter: Introduction to Data Warehousing and Data Mining

Handling Handling Missing Missing Data Data • Manually examine missing data and enter reasonable values • Use automated methods to fill in the missing values. – Average – Mode

• Encode the fact the value is missing, instead of just A, B, C you have A, B, C with A-missing, Bmissing, and C-missing.

49

Example Example

Preprocessed Input

Initial Input A

BC

A’

AM

B’

BM

C’

CM

100 ? 97 96 98

0 1 ? 2 1

100 0 97 96 98

0 1 0 0 0

0 1 0 2 1

0 0 1 0 0

120 140 200 0 160

0 0 0 1 0

120 140 200 ? 160

AM indicates whether or not A is a missing value. 50

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

25

Chapter: Introduction to Data Warehousing and Data Mining

Transforming Transforming Data Data • Possible to transform data in categories input input values – Manual – PCA (principle component analysis) Reduction – Uses eigenvectors to “represent” a vector space

51

Inconsistent Inconsistent Data Data • Display input data in a graph to visually check for outliers • Thresholding to ensure that data falls within certain bounds • Calculate mean and standard deviation to look for outliers • Winzorize data – Set all values below a threshold to the min value – Set all values above the threshold to a max value

52

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

26

Chapter: Introduction to Data Warehousing and Data Mining

Details on Preprocessing

53

Overview Overview • • • • • •

Need for preprocessing Averaging Thresholding Reducing the input space Normalizing Modifying prior probabilities

54

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

27

Chapter: Introduction to Data Warehousing and Data Mining

Need Need for for Preprocessing Preprocessing • Input space is large 10

– If you have 10 variables with only 10 possible values you have 10 = 10,000,000,000 possible different values. Need lots of training data to find meaningful patterns.

• Often data mining can benefit from feature extraction – Instead of just have words, perhaps identify key features such as Person, Organization, etc.

• Data often has missing values, need to fill them in. • Normalize data (many sources might use different units of measure)

55

Averaging Averaging • If the data is primarily time series data, a good method is to average several data points at a time. • For every 5 points simply use as input the average of each 5. Time

Value

T1

100

T2

200

T3

300

T4

400

T5

500

• Preprocessed: T1-T5: 300 56

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

28

Chapter: Introduction to Data Warehousing and Data Mining

Thresholding Thresholding • For an image you might change all values of “dark gray” to black and all values of “very light gray” to bare.

57

Reducing Reducing the the Input Input Space Space • Principle Component Analysis – Factor analysis that redduces the number of input variables – Attempts to identify an m-dimensional subspace of the ndimensional input space that seems most “significant” – PCA works by computing n vectors known as the prinicpal components which provide a basis for the input space. The vecotrs are ordered in sequence from most to least significant. – PCA requires all input variables to be normalized within the same range of data. – The highest prinicpal compoents will exhibit the highest variance. – Input space can be reduced by throwing out the ones with the lowest variance.

58

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

29

Chapter: Introduction to Data Warehousing and Data Mining

More More on on reducing reducing the the input input space space • Sensitivity Analysis – Studies how the changes of an input variable affects the output variable. – If you do a regression and find one of the coefficients is zero, that variable may be viewed as having no real bearing in the production of the output. – Different from other data reduction in that we are looking at the output. – The idea is that if a variable changes and the output changes it’s a useful variable. Otherwise, it may not be so important. Here, we can see A, and B are more important C. C changes when output Y doesn’t change and vice versa. Y A B C 100

10

20

-50

100

12

14

300

150

35

40

-50

200

20

30

-50

59

Normalizing Normalizing Data Data • Many algorithms require normalized data – all data lies on a common scale (e.g. between zero and one) • Three flavors – Min-max – Zscore – Sigmoidal

60

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

30

Chapter: Introduction to Data Warehousing and Data Mining

Min-max Min-max normalization normalization • Linear transformation of the original input range into a newly specified data range (typically 0-1). • Old min value is mapped to new min • Old max is mapped to new max • Let y be the original value, y’ be the new value. • Min1, max1 are the original min, max • Min2, max2 are the new min max  y − min 1  y'=  (max 2 − min 2) + min 2  max 1 − min 1 

61

Min-max Min-max examples examples

 y − min 1  y' =  (max 2 − min 2) + min 2  max 1 − min 1  • Consider old data that ranged from 0-100, we now obtain an equation to migrate it to 5-10.

 y  y' =   + 5  20  • Old = 0, New = 5 • Old = 10, New = 5.5 • Old = 90, New = 9.5

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

62

31

Chapter: Introduction to Data Warehousing and Data Mining

Zscore Zscore • Translate the input variable so that the mean is zero and the variance is one. • The attempts to center the data around the origin. The goal is that most of the data will lie within the origin to one standard deviation. • Works well when min and max are unknown • If majority of data falls within 50 and 100, but you have a few data points outside of that range, zscore will compress most of the data into a small range.

63

Zscore Zscore Continued Continued

y − mean y' = std Ex: 1

Y 0.23 0.79 0.52 0.25 0.80 0.65 0.92 0.21 0.64 0.17 0.63 0.65 0.67 0.71 0.98 0.84 0.10 0.77 0.50 5.00

Y' -0.56 Average -0.01 -0.28 Std-dev -0.54 0.00 -0.15 0.11 -0.58 -0.15 -0.62 -0.17 -0.15 -0.13 -0.09 0.17 0.04 -0.69 -0.03 -0.30 4.11

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

Ex: 2

0.80 1.02

Y 45 35 70 69 22 10 5 15 30 300 21 11 14 27 -50 -75 2 2 2 3

Y' 0.24 Average 0.10 0.58 Std-dev 0.57 -0.08 -0.25 -0.32 -0.18 0.03 3.78 -0.10 -0.23 -0.19 -0.01 -1.08 -1.43 -0.36 -0.36 -0.36 -0.35

27.90 72.08

64

32

Chapter: Introduction to Data Warehousing and Data Mining

Sigmoidal Sigmoidal Normalization Normalization • Transforms input data non-linearly into the range –1 to 1. • Uses mean and standard deviation • Outliers are compressed along the means of the sigmoidal function • Good for data where it is desirable to keep the outliers

65

Sigmoidal Sigmoidal Normalization Normalization Example Example 11

y − mean α= std

1 − e −α y' = 1 + e −α

Alpha -0.56 -0.01 -0.28 -0.54 0.00 -0.15 0.11 -0.58 -0.15 -0.62 -0.17 -0.15 -0.13 -0.09 0.17 0.04 -0.69 -0.03 -0.29 4.11

Y 0.23 0.79 0.52 0.25 0.80 0.65 0.92 0.21 0.64 0.17 0.63 0.65 0.67 0.71 0.98 0.84 0.10 0.77 0.50 5.00

Sig Y' -0.27 -0.01 -0.14 -0.26 0.00 -0.08 0.06 -0.28 -0.08 -0.30 -0.08 -0.07 -0.06 -0.04 0.09 0.02 -0.33 -0.02 -0.15 0.97

Average

0.80

Std-dev

1.02

66

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

33

Chapter: Introduction to Data Warehousing and Data Mining

Sigmoidal Sigmoidal Normalization Normalization Example Example 22 Alpha 0.24 0.10 0.58 0.57 -0.08 -0.25 -0.32 -0.18 0.03 3.78 -0.10 -0.23 -0.19 -0.01 -1.08 -1.43 -0.36 -0.36 -0.36 -0.35

Y 45 35 70 69 22 10 5 15 30 300 21 11 14 27 -50 -75 2 2 2 3

Sig Y' 0.12 0.05 0.28 0.28 -0.04 -0.12 -0.16 -0.09 0.01 0.96 -0.05 -0.12 -0.10 -0.01 -0.49 -0.61 -0.18 -0.18 -0.18 -0.17

Average

27.90

Std-dev

72.08

67

Sigmoidal Sigmoidal and and Zscore Zscore (Example (Example 1) 1)

Ex: 1

Y Zscore Y' 0.23 -0.56 0.79 -0.01 0.52 -0.28 0.25 -0.54 0.80 0.00 0.65 -0.15 0.92 0.11 0.21 -0.58 0.64 -0.15 0.17 -0.62 0.63 -0.17 0.65 -0.15 0.67 -0.13 0.71 -0.09 0.98 0.17 0.84 0.04 0.10 -0.69 0.77 -0.03 0.50 -0.30 5.00 4.11

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

Sig Y' -0.27 -0.01 -0.14 -0.26 0.00 -0.08 0.06 -0.28 -0.08 -0.30 -0.08 -0.07 -0.06 -0.04 0.09 0.02 -0.33 -0.02 -0.15 0.97

Y 45.00 35.00 70.00 69.00 22.00 10.00 5.00 15.00 30.00 300.00 21.00 11.00 14.00 27.00 -50.00 -75.00 2.00 2.00 2.00 3.00

Zcore Y' 0.24 0.10 0.58 0.57 -0.08 -0.25 -0.32 -0.18 0.03 3.78 -0.10 -0.23 -0.19 -0.01 -1.08 -1.43 -0.36 -0.36 -0.36 -0.35

Sig Y' 0.12 0.05 0.28 0.28 -0.04 -0.12 -0.16 -0.09 Ex: 2 0.01 0.96 -0.05 -0.12 -0.10 -0.01 -0.49 -0.61 -0.18 -0.18 -0.18 -0.17 68

34

Chapter: Introduction to Data Warehousing and Data Mining

Summary Summary • Many large data mining projects spend most of their time on data cleanup and preparation (often cited as 80 percent of their work) • Various normalization techniques – Min-max – Zscore – Sigmoidal

69

Linear Linear Regression Regression

70

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

35

Chapter: Introduction to Data Warehousing and Data Mining

Overview Overview • Goal is to fit points to a line • Once equation for a line is obtained, new values can be estimated (predicted) from inputs.

71

Example Example

• Consider the following points for y and x. Given a value of x we would like a model that allows us to predict y. X 5 10 20 25 30 40 50

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

Sample Data

Y

Y 15 22 35 60 70 80 90

60 50 40 30 20 10 0 0

10

20

30

40

50 X

60

70

80

90 100 72

36

Chapter: Introduction to Data Warehousing and Data Mining

Deriving Deriving the the Algorithm Algorithm

• Equation for a line (one variable) is y = mx + b

• where x is the input variable and b is the slope. Goal is to compute the slope m and the intercept b. • Equation for a plane in multidimensional space (we need for this multiple input variables). Here we use w y = w0 + w1x1 + w2 x2 + ... + wN x N 73

Input Input

• We might have several records of training data, where each record indicates the values of each input variable and the output variable. • Input file would look like: Record 1: y1 Record 2: y2 Record 3: y2

x11 x21 x31

x12 x22 x32 74

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

37

Chapter: Introduction to Data Warehousing and Data Mining

Solving Solving for for Multiple Multiple Variables Variables

Record 1: y1 Record 2: y2 Record 3: y2

x11 x21 x31

x12 x22 x32

• We now need one set of slopes that will come close to solving the following equations: y1 = w0 + w1 x11 + w2 x12 y2 = w0 + w1 x 21 + w2 x22 y3 = w0 + w1 x31 + w2 x32 75

Using Using Matrices Matrices for for Regression Regression

y1 = w0 + w1 x11 + w2 x12

• Start with:

y2 = w0 + w1 x 21 + w2 x22 y3 = w0 + w1 x31 + w2 x32

• Cast this as a matrix multiplicat ion:

X W D 1 x11 x12   w0   y1 = w0 + w1 x11 + w2 x12       1 x21 x 22   w1  =  y2 = w0 + w1 x 21 + w2 x 22  1 x x   w   y = w + w x + w x   3 0 1 31 2 32  31 32   2  

Now we just need to solve XW = D

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

76

38

Chapter: Introduction to Data Warehousing and Data Mining

Solving Solving the the Equation Equation

• Start with: XW = D • We want to solve for W. Pre-multiply each side by XT (the transpose of X). We now obtain:

(XT X)W = (X TD) • Now pre-multiply each side by the inverse of (X TX)-1

(X T X) −1 (X T X)W = (X T X) −1(X T D)

W= (XTX)−1(XTD)

• We obtain:

• Since:

(XT X) −1(XT X) = I

• I is the Identity 77 Matrix

Important Important Points Points • W is just an estimate – not always possible to find an exact matrix inverse • Matrix inversion can be a computationally expensive process

78

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

39

Chapter: Introduction to Data Warehousing and Data Mining

Algorithm Algorithm • • • •

Read input variables from a file Compute X-transpose Matrix Multiple X-transpose with X, store in temp Take the matrix inverse of temp and store the result in left • Multiply X-transpose with D and store in right • Multiply left and right

79

Example Example (compute (compute left) left)

•

X

T

T

Start with our two dimensional example and compute X

1 =  1

1 10

1 20

1 25

1 25

 7 180  ( X X) =   180 6150 T

1 30

1 40

1 50

  

X

=

          

1 1 1 1 1 1 1

5 10 20 25 30 40 50

          

 0.577 - 0.017  ( XT X) −1 =   - 0.017 0  80

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

40

Chapter: Introduction to Data Warehousing and Data Mining

Example Example (compute (compute right) right) • Start with the matrix and compute XTD

X

1 =  1

T

1 10

1 20

1 25

1 25

1 30

1 40

 372  XT D =   12295

1 50

  

=

D

 0.577 - 0.017  ( XT X) −1 =   - 0.017 0 

7.014  W = (XT X)−1 (XT D) =   1.794 

          

15 22 35 60 70 80 90

          

81

Graphing Graphing Solution Solution • Using the estimate for W we obtain XW

X

=

          

1 1 1 1 1 1 1

5 10 20 25 30 40 50

          

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

15.984  24.953   42.892  XW = 51.862 60.831  78.77 96.709  82

7.014  W=  1.794 

          

41

Chapter: Introduction to Data Warehousing and Data Mining

Predicted Predicted vs vs Actual Actual

 15   22    35   D =  60  70    80  90  

          

Sample Data

Y

15.984  24.953   42.892  XW = 51.862 60.831  78.77 96.709 

120 100 80 60 40 20 0 0

10

20

30

40

50

60

X

83

Implementation Implementation Issues Issues • Lots of good matrix class libraries • Many matrix multiplication and matrix inversion algorithms • Many libraries lack support for cases where the matrix does not fit into main memory

84

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

42

Chapter: Introduction to Data Warehousing and Data Mining

Decision Decision Trees Trees

85

Algorithm Algorithm • •

Assume we have a data set D and a list of variables to use to build the tree. A split criteria is an object that consists of a variable and a value (e.g.; x > 10)

Algorithm BuildTree(D, vars) currSplit := bestSplit(D,vars) vars := vars – currSplit.variable leftD = Partition(D, currSplit, left) rightD = Partition(D, currSplit, right)’ BuildTree(leftD, vars) BuildTree(rightD, vars) End

86

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

43

Chapter: Introduction to Data Warehousing and Data Mining

Computing Computing aa good good Split Split

Φ ( s ' , t ) = Max( Φ i ( si ' , t )) Φ(s,t) = 2PLPR

numClasses

∑| P( j, t ) − P( j,t ) | j=1

L

R

tL = left split criteria (e.g.; x < 10) tR = right split criteria (e.gl; x >=10) vL = number of values for this variable (e.g.; x) that match criteria tL vR = number of values for this variable that match criteria tR cL = number of class j patterns for tL cR = number of class j patterns for tR n = size of the training set p = number of patterns at t (size of training set at this node)

PL =

vL v c c PR = R P( j,tL ) = L P( j,tR) = R n n p p

87

Details Details of of Splitting Splitting

Φ( s' , t ) = Max(Φ i ( si ' , t )) Takes the maximum value of “goodness” of splits for each possible split criteria. For a variable with a range of 1 to 5, we must take the maximum of: X < 1, X >= 1 X < 2, X >= 2 X < 3, X >= 3 X < 4, X >= 4 X < 5, X >= 5

88

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

44

Chapter: Introduction to Data Warehousing and Data Mining

Computing Computing Goodness Goodness of of Split Split

Φ (s , t ) = 2PL PR

numClasses

∑ | P( j , t j =1

L

) − P( j , t R ) |

For a single split value, we compute the “goodness” by testing how well it splits all three classes. Consider a split of X < 5 for the following X, Y combination where we are using X to predict Y. X 1 2 3 4 5 6 7 8

Y 0 0 0 0 1 1 1 1

P L = 4/8 = 0.5

P R = 4/8 = 0.5

2 P L P R = (2)(.5)(.5) = .5 For Class 0: P(j, tL) = 4/4 = 1

P(j, tR ) = 0 / 4 = 0

For Class 1: P(j, tL) = 0/4 = 1

P(j, tR ) = 4/4 = 1

Total for X < 5: (.5) (2) = 1.0

89

Split Split Example Example 1 2 3 4 5 6 7 8

0 0 0 0 1 1 1 1

NumRows

8

Split patl patr Pl Pr Pjl0 Pjr0 Pjl1 Pjr1

X<2 1 7 0.125 0.875 1.00 0.43 0.00 0.57

X<3 2 6 0.25 0.75 1.00 0.33 0.00 0.67

X<4 3 5 0.375 0.625 1.00 0.20 0.00 0.80

X<5 4 4 0.5 0.5 1.00 0.00 0.00 1.00

X<6 5 3 0.625 0.375 0.80 0.00 0.20 1.00

X<7 6 2 0.75 0.25 0.67 0.00 0.33 1.00

X<8 7 1 0.875 0.125 0.57 0.00 0.43 1.00

2PlPr Pj0 Pj1 Sum

0.22 0.57 0.57 0.25

0.38 0.67 0.67 0.50

0.47 0.80 0.80 0.75

0.50 1.00 1.00 1.00

0.47 0.80 0.80 0.75

0.38 0.67 0.67 0.50

0.22 0.57 0.57 0.25

90

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

45

Chapter: Introduction to Data Warehousing and Data Mining

Summary Summary • For each variable, obtain all distinct values – For each value I – For each possible output class – Compute the split as applied to the class

– Sum all the classes

• Take the max of all “goodness of split” values.

91

Scalability Scalability • Decision Tree is usually built such that it fits entirely in memory • Can do a memory-management scheme in which each split is computed incrementally. • Also, can build several decision trees and attempt to merge them.

92

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

46

Chapter: Introduction to Data Warehousing and Data Mining

Bootstrapped Bootstrapped Optimistic Optimistic Algorithm Algorithm for for Tree Tree Construction (BOAT) Construction (BOAT)

• BOAT – Goal is to improve scalability of decision tree algorithms – Most algorithms require training data to fit into main memory

93

Details Details of of BOAT BOAT •

Basic Steps of BOAT – Build an initial estimate of a good tree by taking several samples of the training data (each sample fits into main memory) – Build a decision tree for each sample – Using all the samples, build an estimated tree. This is done by scanning in a top down fashion each sample. If the splitting variable matches at a given node there are two cases: – Categorical variable » (e.g.; gender = male). If the variable and value match perfectly in all the estimated trees, use this criteria as the estimated criteria. » Continuous variable. If the variable matches, then build a wide estimated split range that includes all of the ranges. For example: If one tree chooses X < 10 and another uses X < 15 – use X < 15 in our estimated tree.

– Refine the estimates

94

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

47

Chapter: Introduction to Data Warehousing and Data Mining

BOAT BOAT (continued) (continued) •

Refining the estimates – For a wide confidence range, obtain all training data that fits into that range (this is a much smaller set than the whole training data). This either fits into main memory or can be periodically written to temp files. – Now if it’s a categorical variable there is no range, so there is no refinement to do. – If it s a continuous variable (10 < x < 30), try all splits within the range and test the purity of each split. Use the split with the best purity. – Now with the refined split, we can move on down the tree

•

More details are in the paper – J. Gehrke, V. Ganti, R. Ramakrishnan, W. Loh, “BOAT-Optimistic Decision Tree Construction”, In Proceedings of the 1999 SIGMOD Conference, Philadelphia, Pennsylvania, 1999. – (paper is at: http://www.cs.cornell.edu/johannes/)

95

Overview Overview • • • •

Statement of the Problem Simple Examples Algorithm Multilevel associations

96

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

48

Chapter: Introduction to Data Warehousing and Data Mining

Problem Problem • Basic idea is that we are given a set of transactions such as person bought bread, butter, milk, eggs and we would like to identify rules such that: – BREAD, BUTTER à Eggs

• The idea is that we need to count various combinations of items that co-occur in a transaction and then filter them for the ones that are most common.

97

Queries Queries • Some common queries that association rules might provide some answers: – Find all rules that have Apple Juice in the consequent. Might give some idea of how to boost Juice sales. – Find all rules that have “shoe polish” in the antecedent. Might give some idea of what would happen to sales of other products if “shoe polish” was discontinued. – Find all rules that have “jelly” in the antecedent and “peanut butter” in the consequent. Tell us what else might be sold with the peanut butter and jelly combination. – Find all rules relating items located on a shelf A. Determine if colocating on a shelf boosts sales. – Note that some large installations have said that they are not t hat inclined to re-arrange their shelves on a frequent basis. – Which parts of a shelf result in the most sales. 98

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

49

Chapter: Introduction to Data Warehousing and Data Mining

Formal Formal Statement Statement of of the the Problem Problem

I = {i1 , i2 ,..., in }

•

Literals (e.g. bread, milk, etc.) :

•

D set of transactions that contain items in I. Note the quantity of items bought is not considered. Each transaction has a unique id, TID. Let X be a set of items from I. A transaction T is said to contain X if and only if X is a proper subset of T. An association rule is an implication of the form XèY where X is a subset of items and Y is a subset of items and X and Y have no items in common (that is we don’t say Bread è Bread). The rule XèY holds in transaction set D with confidence c if c% of transactions in D that contain X also contain Y. XèY has support s in the transaction set D if s% of transactions in D contain X U Y.

• •

• • •

A strong rule has high confidence and high support 99

Example Example •

• • •

Consider the following set of transactions:

TID

Items

100 200 300 400

A, C, D B, C, E A, B, C, E B, E

Consider the rule Aè C Confidence is 2/2 = 100%, both of the transactions that contain A also contain C. Support is 2/4 = 50% of the transactions contain A and C.

100

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

50

Chapter: Introduction to Data Warehousing and Data Mining

General General Approach Approach • •

To find association rules the worst case is exponential. Consider 3 items, a,b, and c. We have the following rules to consider: aà b aà c aà b,c bà a bà c bà a,c cà a cà b cà a,b

• •

The general approach is to identify the itemsets with some support above a threshold s. For the large itemsets generate the assocation rules for the database. 101

Overview Overview of of the the Algorithm Algorithm Find initial significant candidate items This is done by counting the number of transactions for all single items. Keep the big ones. These are the candidate itemset of size one. Lets say {{bread},{butter}}. Do until there are no more candidates • k=k +1 • Generate all itemsets of size k from the candidate set of k-1. The 2-itemset would be {bread,butter}. • Update the count of all the candidate itemsets of size k • Eliminate any candidates with support that does not exceed the threshold. End 102

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

51

Chapter: Introduction to Data Warehousing and Data Mining

Simple Simple Algorithm Algorithm for for items items of of size size 33 • Step 1: Count all itemsets of size 1. • Step 2: Remove all itemsets below support s • Step 3: Use remaining itemsets to identify all candidates of size 2. • Step 4: Obtain transaction count of all candidates (have to scan the database) of size 2. • Step 5: Remove all itemsets below s. • Step 6: Scan all candidates of size 2, if any of the candidates have a common beginning (say AB and AC), now check to see if the ending (BC) is a candidate. If the ending is in the set of candidates, form a new candidate of size 3 (ABC). 103

Example Example of of Simple Simple Algorithm Algorithm • Step 1: Count on all itemsets of size 1 Itemset {A} {B} {C} {D} {E}

Count 2 3 3 1 3

• Step 2: Remove all itemsets of size s (lets say s=1) Itemset Count {A} 2 {B} 3 {C} 3 {E} 3 104

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

52

Chapter: Introduction to Data Warehousing and Data Mining

Example Example (continued) (continued) • Step 3 and 4: Scan and count remaining itemsets of size 2 (use size 1 sets combined with themselves) Itemset {A B} {A C} {A E} {B C} {B E} {C E}

Count 1 2 1 2 3 2

105

Example Example (continued) (continued) •

Step 5: Remove all 2-itemsets without enough support.

Itemset {A C} {B C} {B E} {C E} •

•

Count 2 2 3 2

Step 6: Now we need a 3-itemset. Scan all candidates of size 2, if any of the candidates have a common beginning {B C} and {B E}, now check the size of the ending {C E}. {C E} is in our list so this means all subsets of {BCE} are of at least size two. Hence, it is a candidate for a size 3 itemset. Step 7: Scan the database to obtain count for {BCE}. It is two, hence, the algorithm ends as there are no four-level itemsets that can be generated by {BCE} because there is only one 3-itemset. 106

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

53

Chapter: Introduction to Data Warehousing and Data Mining

Overview Overview of of the the Details Details • Two key steps have been glossed over at this point – Generating all pairs of itemsets that should be considered for the next stage. – Scanning the database to update the counts for the candidate itemsets. This is not so easy because the candidate itemsets may be {{AB},{CD}} and a transaction containing them may be {ABCD}, or {ABCDE}. Thus a single transaction may update many candidates in an itemset.

107

Generate Generate and and Prune Prune •

Generate – We need to take an existing candidate itemset (say itemsets of size k) and generate all itemsets of size k+1. – First we find all itemsets with a common prefix (all entries except for the last digit) and then make sure their suffix (last digit) is in the candidate set as well. – Consider A = {{123},{124},{134},{135},{234}} – Match all of these with themselves (A x A) – We obtain: {{123 124}, {123 134}, {123 135}, {123 234},…,}

– We find that {123 124} has a common prefix {12} so we have a new candidate {1234}, similarly {134 135} will generate {1345}

•

Prune – Now we prune the candidates by checking to see if all subsets of size k-1 of each candidate are indeed candidates. – For {1234}, we have subsets {{123},{134},{234},{124}} so it is fine. – For {1345}, we have subsets {{134},{135},{145},{345}}, {134} and {135} are in the candidates, but {145} and for that matter {345} are NOT so this one is pruned. 108

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

54

Chapter: Introduction to Data Warehousing and Data Mining

Update Update Itemset Itemset Counts Counts •

Now we have a set of itemsets and a set of transactions. The items in an itemset and an entry of a transaction are listed in sorted order (e.g.; {123} instead of {231}). Consider the 2-itemset {{13}, {15}, {35}} with transactions:

• •

T1: 245, 1345, 3487, T2: 1245

•

Note that T1 requires an update to the count of itemsets {13}, {15}, and {35}. T2 requires an update to {15}. Crude algorithm would be for each transaction, scan all items in the transaction, for each item break it up into all subsets and then check each subset with the list of itemsets. If we end up with a LOT of candidate itemsets or big transactions, this can be expensive.

•

109

scanCount scanCount with with HashTree HashTree • •

• •

scanCount can be improved with a HashTree. HashTree has at its leaves a set of itemsets. Any nonleaf entry simply contains a hash table that points to descendants. The pointer to descendants is obtained by a hash function. All leaves share a common prefix. Leaves split when they contain above a threshold of too many nodes. At this point a hash table is built with an entry for each of the distinct suffixes in the leaf node. 1

H(1) H(2) H(3)

2

H(1) H(2) H(3)

H(4)

3

4 238 239 145

Note a transaction of 124 or 1234, etc. Would all land in this leaf.

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

110

55

Chapter: Introduction to Data Warehousing and Data Mining

Algorithm Algorithm •

Definitions – Lk – set of large k itemsets (those with minimum support) – Ck – set of candidate itemsets (potentially large itemsets), has two attributes itemset and count.

L1 = {large 1 itemsets) For (k = 2; Lk-1 <> 0; k++) do begin Ck = generate(Lk-1 ) // now we have to scan all transactions and update the counts for the candidate itemsets Forall transactions t in D do begin Ct = subset(Ck,t) Forall candidates c in Ct do C.count++ End End Lk = {c in Ck | c.count >= minimumSupport} End

Answer is the Union of all large K itemsets 111

Improvements Improvements • The basic algorithm can be improved by: – Reducing size of database that needs to be searched – Using parallel processors – Developing better filters that restrict the number of candidate itemsets – Incremental update of associaton rules so that you do not have to rescan to identify them – On-line generation of rules so users can change thresholds as they are being computed – Various things can be done for scalability – Use database system for access to transactions – Run this algorithms for as many transactions as will fit into main memory, then sum results, etc.

112

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

56

Chapter: Introduction to Data Warehousing and Data Mining

Generalized Generalized Association Association Rules Rules • Often products are associated into categories such as Bread: wheat bread, white bread, ninegrain bread. • Rules may only be meaningful when used at multiple levels. Food

milk

2%

bread

chocolate

white

wheat

113

Summary Summary • Association rules can be helpful in making business decisions • Expensive to compute, most algorithms developed today are based on the one we have described. • Scalability to large volumes of data remains a key issue.

114

© 1999 Carrig Emerging Technology, Inc. All rights reserved. v1.0

57

Overview Data Mining

Overview

More details

Related Documents

Data Mining Overview

Overview Data Mining

Data Mining

Data Mining

Data Mining

Data Mining