L12 Dectree14-08-09

  • Uploaded by: Allison Collier
  • 0
  • 0
  • July 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View L12 Dectree14-08-09 as PDF for free.

More details

  • Words: 2,894
  • Pages: 44
Decision tree Induction Lecture 12 14-08-09

Lect 12/ 14-08-09

1

Name

Body temp

Skin Cover

Gives birth Aquatic creature Legs

Hibernates Class label

Body temp. Root node

cold

warm Internal node

Nonmammals

Gives birth yes

no

Mammals

Leaf Node

Non-mammals Non-mammals

Leaf nodes Lect 12/ 14-08-09

2

Example whether a person will buy a computer/not age <=30 <=30 30…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40

income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent Lect 12/ 14-08-09

buys_computer no no yes yes yes no yes no yes yes yes yes yes no 3

Output: A Decision Tree for “buys_computer” In order to classify an unknown sample, the attribute values of the sample are tested against the decision tree.

Root node

A path is traced from root to a leaf node that holds the class prediction for that sample.

age?

<=30

30..40 overcast

>40

yes yes

student?

credit rating?

no

yes

excellent

no no

yes yes

no no Lect 12/ 14-08-09

fair

Decision trees can be converted easily to classification rules.

yes

yes 4

Apply Model to Test Data Start from the root of tree.

Test Data

Refund Yes

No

NO

Refund Marital Status

Taxable Income Cheat

No

80K

Married

?

10

MarSt Married

Single, Divorced TaxInc

NO

< 80K NO

> 80K YES

Lect 12/ 14-08-09

5

Apply Model to Test Data Test Data

Refund Yes

Refund Marital Status

Taxable Income Cheat

No

80K

Married

?

10

No

NO

MarSt Married

Single, Divorced TaxInc

NO

< 80K NO

> 80K YES

Lect 12/ 14-08-09

6

Apply Model to Test Data Test Data

Refund Yes

Refund Marital Status

Taxable Income Cheat

No

80K

Married

?

10

No

NO

MarSt Married

Single, Divorced TaxInc

NO

< 80K NO

> 80K YES

Lect 12/ 14-08-09

7

Apply Model to Test Data Test Data

Refund Yes

Refund Marital Status

Taxable Income Cheat

No

80K

Married

?

10

No

NO

MarSt Married

Single, Divorced TaxInc

NO

< 80K NO

> 80K YES

Lect 12/ 14-08-09

8

Apply Model to Test Data Test Data

Refund Yes

Refund Marital Status

Taxable Income Cheat

No

80K

Married

?

10

No

NO

MarSt Married

Single, Divorced TaxInc

NO

< 80K NO

> 80K YES

Lect 12/ 14-08-09

9

Apply Model to Test Data Test Data

Refund Yes

Refund Marital Status

Taxable Income Cheat

No

80K

Married

?

10

No

NO

MarSt Married

Single, Divorced TaxInc

NO

< 80K NO

Assign Cheat to “No”

> 80K YES

Lect 12/ 14-08-09

10

How to build a decision tree

• Exponentially many trees can be build from a set of attri. • Finding optimal tree is infeasible. • But efficient algo. have been designed to obtain reasonably accurate DT that are constructed in reasonable amt. of time. • These algos. usually employ a greedy strategy that grows DT by making series of locally optimum decision about “which attr. to use for partitioning the data” Lect 12/ 14-08-09

11

Decision Tree Induction • Many Algorithms: – Hunt’s Algorithm (one of the earliest) – CART – ID3, C4.5 – SLIQ,SPRINT

Lect 12/ 14-08-09

12

Decision Tree Cont… Lecture 13/17-08-09

Lect 12/ 14-08-09

13

General Structure of Hunt’s Algorithm • Let Dt be the set of training records that are associated with a node t. • General Procedure: – If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt – If Dt is an empty set, then t is a leaf node labeled by the default class, yd – If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.

Tid Ho ow 1

?

Lect 12/ 14-08-09

2

Dt

Ye

No 14

Hunt’s Algorithm Cheat= no

Yes

(a)

Home owner

No Cheat= no

Cheat= no

(b) Home owner

Home owner Yes

Yes

No

Cheat =no Single, Divorced

Cheat= yes

(c)

No

Cheat= no

Marital Status Married

Single, Divorced

Cheat= no

Marital Status Married Cheat= no

Taxable Income < 80K Cheat= no

>= 80K

Cheat= yes

Lect 12/ 14-08-09

(d) 15

Design Issues of Tree Induction

• Determine how to split the records

– 1. Each recursive step of the tree-generating

process must select an attr. test cond. to divide the records into smaller subsets. – For this, the algo must provide a method for specifying the test cond for diff. attr types as well as an objective measure for evaluating the goodness of each test cond.

– Determine when to stop splitting • A stopping cond is needed to terminate the tree-

generating process. • For this, continue expanding a node until either all the records belong to the same class or all records have identical attr values. Lect 12/ 14-08-09

16

How to Specify Test Condition? • Depends on attribute types – Nominal – Ordinal – Continuous

• Depends on number of ways to split – 2-way split – Multi-way split Lect 12/ 14-08-09

17

Methods for expressing attr test condition • Binary attributes • Test cond for a binary attr generates 2 potential outcomes,

Body temprature

Warm-blooded

Cold-blooded

Lect 12/ 14-08-09

18

Splitting Based on Nominal Attributes • Multi-way split: Use as many partitions as distinct values. CarType

Marital status

Family

Luxury Sports

single

divorced

Married

• Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury}

CarType {Family}

OR

{Family, Luxury}

Lect 12/ 14-08-09

CarType {Sports}

19

Splitting Based on Ordinal Attributes • Multi-way split: Use as many partitions as distinct values. Size Small Medium

Large

• Binary split: Divides values into two sub sets. Need to find optimal partitioning. {Small, Medium}

Size {Large}

OR

• What about this split?

{Medium, Large}

{Small, Large}

Lect 12/ 14-08-09

Size {Small}

Size {Medium} 20

Splitting Based on Continuous Attributes • Different ways of handling – Binary Decision: (A < v) or (A ≥ v) • consider all possible splits and find the best cut • can be computationally intensive

– Multi-way split: • The algo must consider all possible ranges of cont values, vi<=A
21

Splitting Based on Continuous Attributes Taxable Income > 80K?

Taxable Income? < 10K

Yes

> 80

No [10K,25K)

(i) Binary split

[25K,50K)

[50K,80

(ii) Multi-way split

Lect 12/ 14-08-09

22

How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1

Own Gender Car? M Yes

F

No

Car Type? Family

Cust_ID

Luxury

c1

Sports C0: 6 C1: 4

C0: 4 C1: 6

C0: 1 C1: 3

C0: 8 C1: 0

C0: 1 C1: 7

C0: 1 C1: 0

...

Studen ID? c10 C0: 1 C1: 0

Which test condition is the best?

Lect 12/ 14-08-09

23

c11

C C

How to determine the Best Split • Greedy approach: – Nodes with homogeneous class distribution are preferred

• Need a measure of node impurity: C0: 5 C1: 5

C0: 9 C1: 1

Non-homogeneous,

Homogeneous,

High degree of impurity

Low degree of impurity

Lect 12/ 14-08-09

24

Measure of selecting the best split • These measures are defined in terms of the class distribution before and after splitting. • Let pi --- fraction of the records belonging to a class i at a given node t. • For the previous dataset, the class dist. before splitting is (0.5,0.5) b’coz there are equal no. of records for both the classes.

Lect 12/ 14-08-09

25

• Consider a split at “gender”, class dist. of child nodes are (0.6,0.4) & (0.4,0.6). • Similarly for “car type”. • A node with CD (0,1) has 0 impurity and a CD (0.5,0.5) has highest impurity. • Impurity measures are:

Lect 12/ 14-08-09

26

Measures of Node Impurity • Gini =

c− 1

2

[ pi ] 1 −∑ i =0

• Entropy=

c −1

− ∑ pi ( log 2 p i ) i =0

• (Mis) classification error=

1 − . max[ pi ] i

pi --- fraction of the records belonging to a class i at a given node t, c = no. of classes & 0 log 0=0 in entropy calculations. Lect 12/ 14-08-09

27

How to Find the Best Split Before Splitting:

A? Yes Node N1

M1

M0

C0 C1 C0 C0 N10 C0N C1 C1 N11 C1N B?

No

Yes

No

Node N2

Node N3

Node N4

M2

M3

M4

M12

M34

Gain = M0 – M12 vs M0 – M34 Lect 12/ 14-08-09

28

Measure of Impurity: GINI • Gini Index for a given node t : GINI (t ) = 1 − ∑ [ p ( j | t )]2 j

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information – Minimum (0.0) when all records belong to one class, implying most interesting information

C1 C2

0 6

Gini=0.000

C1 C2

1 5

Gini=0.278

C1 C2

2 4

Gini=0.444

C1 C2

3 3

Gini=0.500

29

Examples for computing GINI GINI (t ) = 1 − ∑ [ p ( j | t )]2 j

C1 C2

0 6

P(C1) = 0/6 = 0

C1 C2

1 5

P(C1) = 1/6

C1 C2

2 4

P(C1) = 2/6

P(C2) = 6/6 = 1

Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

P(C2) = 5/6

Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C2) = 4/6

Gini = 1 – (2/6)2 – (4/6)2 = 0.444 Lect 12/ 14-08-09

30

Examples for computing Entropy Entropy

=

c −1

− ∑ pi ( log 2 p i ) i =0

C1 C2

0 6

P(C1) = 0/6 = 0

C1 C2

1 5

P(C1) = 1/6

C1 C2

2 4

P(C1) = 2/6

P(C2) = 6/6 = 1

Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

P(C2) = 5/6

Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 P(C2) = 4/6

Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92 Lect 12/ 14-08-09

31

Examples for Computing Error Error (t ) = 1 − max pi i

C1 C2

0 6

P(C1) = 0/6 = 0

C1 C2

1 5

P(C1) = 1/6

C1 C2

2 4

P(C1) = 2/6

P(C2) = 6/6 = 1

Error = 1 – max (0, 1) = 1 – 1 = 0

P(C2) = 5/6

Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C2) = 4/6

Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3 Lect 12/ 14-08-09

32

Comparison among Splitting •This fig compares Criteria the values of the For a 2-class problem: impurity measures for binary classification problems

• ‘p’ refers to the fraction of rec. that belong to one of the two classes •All measures attain their max. value when CD is uniform (p=0.5) and min. values when all records belong to the same class (p=0 or 1) 33

Customer ID

Gender

Car Type

Shirt Size

Class

1

M

Family

S

C0

2

M

Sports

M

C0

3

M

Sports

M

C0

4

M

Sports

L

C0

5

M

Sports

XL

C0

6

M

Sports

XL

C0

7

F

Sports

S

C0

8

F

Sports

S

C0

9

F

Sports

M

C0

10

F

Luxury

L

C0

11

M

Family

L

C1

12

M

Family

XL

C1

13

M

Family

M

C1

14

M

Luxury

XL

C1

15

F

Luxury

S

C1

16

F

Luxury

S

C1

17

F

Luxury

M

C1

18

F

Luxury

M

C1

19

F

Luxury

M

C1

20

F

Luxury

L

C1

34

• Q1: consider the training ex shown in previous slide for a binary classification problem. • (a) compute Gini index for Customer ID attr. • (b) Gini Index for Gender attr. • (c) Gini index for Car type attr. Using multi way split

Lect 12/ 14-08-09

35

Splitting Based on GINI • Used in CART, SLIQ, SPRINT. • When a node p is split into k partitions (children), the quality of split is computed as,

GINI split where,

k

ni = ∑ GINI (i ) i =1 n

ni = number of records at child i,

n = number of records at node p. Lect 12/ 14-08-09

36

Binary Attributes: Computing GINI Index • Splits into two partitions • Effect of Weighing partitions: – Larger and Purer Partitions are sought for. Parent

B? Yes

No

C1

6

C2

6

Gini = 0.500

Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194 Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528

Node N1

Node N2

C1 C2

N1 5 2

N2 1 4

Gini=0.333 Lect 12/ 14-08-09

Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528 = 0.333 37

Categorical Attributes: Computing Gini Index • For each distinct value, gather counts for each class in the dataset • Use the count matrix to make decisions Two-way split (find best partition of values)

Multi-way split CarType

Family Sports Luxury 1 2 1 C1 4 1 1 C2 Gini 0.393

C1 C2 Gini

CarType {Sports, {Family} Luxury} 3 1 2 4 0.400

Lect 12/ 14-08-09

C1 C2 Gini

CarType {Family, {Sports} Luxury} 2 2 1 5 0.419

38

Continuous Attributes: Computing Gini Index • Use Binary Decisions based on one value • Several Choices for the splitting value – Number of possible splitting values = Number of distinct values • Each splitting value has a count matrix associated with it – Class counts in each of the partitions, A < v and A ≥ v • Simple method to choose best v – For each v, scan the database to gather count matrix and compute its Gini index – Computationally Inefficient! Repetition of work.

Taxable Income > 80K? Yes

Lect 12/ 14-08-09

No

39



Continuous Attributes: Computing Gini Index... For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index

Cheat

No

No

No

Yes

Yes

Yes

No

No

No

No

100

120

125

220

Taxable Income 60

Sorted Values Split Positions

70

55

75

65

85

72

90

80

95

87

92

97

110

122

172

230

<=

>

<=

>

<=

>

<=

>

<=

>

<=

>

<=

>

<=

>

<=

>

<=

>

<=

>

Yes

0

3

0

3

0

3

0

3

1

2

2

1

3

0

3

0

3

0

3

0

3

0

No

0

7

1

6

2

5

3

4

3

4

3

4

3

4

4

3

5

2

6

1

7

0

Gini

0.420

0.400

0.375

0.343

0.417

0.400

Lect 12/ 14-08-09

0.300

0.343

0.375

0.400

0.420

40



Alternative Splitting Criteria based on INFO Entropy at a given node t: Entropy (t ) = − ∑ p ( j | t ) log p ( j | t ) j

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Measures homogeneity of a node. • Maximum (log nc) when records are equally distributed among all classes implying least information • Minimum (0.0) when all records belong to one class, implying most information

– Entropy based computations are similar to the GINI index computations Lect 12/ 14-08-09

41

Splitting Based on INFO... • Information Gain:

GAIN

n   = Entropy ( p ) −  ∑ Entropy (i )   n  k

split

i

i =1

Parent Node, p is split into k partitions; ni is number of records in partition i

– Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) – Used in ID3 and C4.5 – Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure. Lect 12/ 14-08-09

42

Splitting Based on INFO... • Gain Ratio:

GainRATIO

GAIN n n = SplitINFO = −∑ log SplitINFO n n Split

split

k

i

i

i =1

Parent Node, p is split into k partitions ni is the number of records in partition i

– Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! – Used in C4.5 – Designed to overcome the disadvantage of Information Gain

Lect 12/ 14-08-09

43

Splitting Criteria based on Classification Error • Classification error at a node t :

Error (t ) = 1 − max P (i | t ) i

• Measures misclassification error made by a node. • Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information • Minimum (0.0) when all records belong to one class, implying most interesting information

Lect 12/ 14-08-09

44

Related Documents


More Documents from ""