Decision tree Induction Lecture 12 14-08-09
Lect 12/ 14-08-09
1
Name
Body temp
Skin Cover
Gives birth Aquatic creature Legs
Hibernates Class label
Body temp. Root node
cold
warm Internal node
Nonmammals
Gives birth yes
no
Mammals
Leaf Node
Non-mammals Non-mammals
Leaf nodes Lect 12/ 14-08-09
2
Example whether a person will buy a computer/not age <=30 <=30 30…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40
income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent Lect 12/ 14-08-09
buys_computer no no yes yes yes no yes no yes yes yes yes yes no 3
Output: A Decision Tree for “buys_computer” In order to classify an unknown sample, the attribute values of the sample are tested against the decision tree.
Root node
A path is traced from root to a leaf node that holds the class prediction for that sample.
age?
<=30
30..40 overcast
>40
yes yes
student?
credit rating?
no
yes
excellent
no no
yes yes
no no Lect 12/ 14-08-09
fair
Decision trees can be converted easily to classification rules.
yes
yes 4
Apply Model to Test Data Start from the root of tree.
Test Data
Refund Yes
No
NO
Refund Marital Status
Taxable Income Cheat
No
80K
Married
?
10
MarSt Married
Single, Divorced TaxInc
NO
< 80K NO
> 80K YES
Lect 12/ 14-08-09
5
Apply Model to Test Data Test Data
Refund Yes
Refund Marital Status
Taxable Income Cheat
No
80K
Married
?
10
No
NO
MarSt Married
Single, Divorced TaxInc
NO
< 80K NO
> 80K YES
Lect 12/ 14-08-09
6
Apply Model to Test Data Test Data
Refund Yes
Refund Marital Status
Taxable Income Cheat
No
80K
Married
?
10
No
NO
MarSt Married
Single, Divorced TaxInc
NO
< 80K NO
> 80K YES
Lect 12/ 14-08-09
7
Apply Model to Test Data Test Data
Refund Yes
Refund Marital Status
Taxable Income Cheat
No
80K
Married
?
10
No
NO
MarSt Married
Single, Divorced TaxInc
NO
< 80K NO
> 80K YES
Lect 12/ 14-08-09
8
Apply Model to Test Data Test Data
Refund Yes
Refund Marital Status
Taxable Income Cheat
No
80K
Married
?
10
No
NO
MarSt Married
Single, Divorced TaxInc
NO
< 80K NO
> 80K YES
Lect 12/ 14-08-09
9
Apply Model to Test Data Test Data
Refund Yes
Refund Marital Status
Taxable Income Cheat
No
80K
Married
?
10
No
NO
MarSt Married
Single, Divorced TaxInc
NO
< 80K NO
Assign Cheat to “No”
> 80K YES
Lect 12/ 14-08-09
10
How to build a decision tree
• Exponentially many trees can be build from a set of attri. • Finding optimal tree is infeasible. • But efficient algo. have been designed to obtain reasonably accurate DT that are constructed in reasonable amt. of time. • These algos. usually employ a greedy strategy that grows DT by making series of locally optimum decision about “which attr. to use for partitioning the data” Lect 12/ 14-08-09
11
Decision Tree Induction • Many Algorithms: – Hunt’s Algorithm (one of the earliest) – CART – ID3, C4.5 – SLIQ,SPRINT
Lect 12/ 14-08-09
12
Decision Tree Cont… Lecture 13/17-08-09
Lect 12/ 14-08-09
13
General Structure of Hunt’s Algorithm • Let Dt be the set of training records that are associated with a node t. • General Procedure: – If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt – If Dt is an empty set, then t is a leaf node labeled by the default class, yd – If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.
Tid Ho ow 1
?
Lect 12/ 14-08-09
2
Dt
Ye
No 14
Hunt’s Algorithm Cheat= no
Yes
(a)
Home owner
No Cheat= no
Cheat= no
(b) Home owner
Home owner Yes
Yes
No
Cheat =no Single, Divorced
Cheat= yes
(c)
No
Cheat= no
Marital Status Married
Single, Divorced
Cheat= no
Marital Status Married Cheat= no
Taxable Income < 80K Cheat= no
>= 80K
Cheat= yes
Lect 12/ 14-08-09
(d) 15
Design Issues of Tree Induction
• Determine how to split the records
– 1. Each recursive step of the tree-generating
process must select an attr. test cond. to divide the records into smaller subsets. – For this, the algo must provide a method for specifying the test cond for diff. attr types as well as an objective measure for evaluating the goodness of each test cond.
– Determine when to stop splitting • A stopping cond is needed to terminate the tree-
generating process. • For this, continue expanding a node until either all the records belong to the same class or all records have identical attr values. Lect 12/ 14-08-09
16
How to Specify Test Condition? • Depends on attribute types – Nominal – Ordinal – Continuous
• Depends on number of ways to split – 2-way split – Multi-way split Lect 12/ 14-08-09
17
Methods for expressing attr test condition • Binary attributes • Test cond for a binary attr generates 2 potential outcomes,
Body temprature
Warm-blooded
Cold-blooded
Lect 12/ 14-08-09
18
Splitting Based on Nominal Attributes • Multi-way split: Use as many partitions as distinct values. CarType
Marital status
Family
Luxury Sports
single
divorced
Married
• Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury}
CarType {Family}
OR
{Family, Luxury}
Lect 12/ 14-08-09
CarType {Sports}
19
Splitting Based on Ordinal Attributes • Multi-way split: Use as many partitions as distinct values. Size Small Medium
Large
• Binary split: Divides values into two sub sets. Need to find optimal partitioning. {Small, Medium}
Size {Large}
OR
• What about this split?
{Medium, Large}
{Small, Large}
Lect 12/ 14-08-09
Size {Small}
Size {Medium} 20
Splitting Based on Continuous Attributes • Different ways of handling – Binary Decision: (A < v) or (A ≥ v) • consider all possible splits and find the best cut • can be computationally intensive
– Multi-way split: • The algo must consider all possible ranges of cont values, vi<=A
21
Splitting Based on Continuous Attributes Taxable Income > 80K?
Taxable Income? < 10K
Yes
> 80
No [10K,25K)
(i) Binary split
[25K,50K)
[50K,80
(ii) Multi-way split
Lect 12/ 14-08-09
22
How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1
Own Gender Car? M Yes
F
No
Car Type? Family
Cust_ID
Luxury
c1
Sports C0: 6 C1: 4
C0: 4 C1: 6
C0: 1 C1: 3
C0: 8 C1: 0
C0: 1 C1: 7
C0: 1 C1: 0
...
Studen ID? c10 C0: 1 C1: 0
Which test condition is the best?
Lect 12/ 14-08-09
23
c11
C C
How to determine the Best Split • Greedy approach: – Nodes with homogeneous class distribution are preferred
• Need a measure of node impurity: C0: 5 C1: 5
C0: 9 C1: 1
Non-homogeneous,
Homogeneous,
High degree of impurity
Low degree of impurity
Lect 12/ 14-08-09
24
Measure of selecting the best split • These measures are defined in terms of the class distribution before and after splitting. • Let pi --- fraction of the records belonging to a class i at a given node t. • For the previous dataset, the class dist. before splitting is (0.5,0.5) b’coz there are equal no. of records for both the classes.
Lect 12/ 14-08-09
25
• Consider a split at “gender”, class dist. of child nodes are (0.6,0.4) & (0.4,0.6). • Similarly for “car type”. • A node with CD (0,1) has 0 impurity and a CD (0.5,0.5) has highest impurity. • Impurity measures are:
Lect 12/ 14-08-09
26
Measures of Node Impurity • Gini =
c− 1
2
[ pi ] 1 −∑ i =0
• Entropy=
c −1
− ∑ pi ( log 2 p i ) i =0
• (Mis) classification error=
1 − . max[ pi ] i
pi --- fraction of the records belonging to a class i at a given node t, c = no. of classes & 0 log 0=0 in entropy calculations. Lect 12/ 14-08-09
27
How to Find the Best Split Before Splitting:
A? Yes Node N1
M1
M0
C0 C1 C0 C0 N10 C0N C1 C1 N11 C1N B?
No
Yes
No
Node N2
Node N3
Node N4
M2
M3
M4
M12
M34
Gain = M0 – M12 vs M0 – M34 Lect 12/ 14-08-09
28
Measure of Impurity: GINI • Gini Index for a given node t : GINI (t ) = 1 − ∑ [ p ( j | t )]2 j
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information – Minimum (0.0) when all records belong to one class, implying most interesting information
C1 C2
0 6
Gini=0.000
C1 C2
1 5
Gini=0.278
C1 C2
2 4
Gini=0.444
C1 C2
3 3
Gini=0.500
29
Examples for computing GINI GINI (t ) = 1 − ∑ [ p ( j | t )]2 j
C1 C2
0 6
P(C1) = 0/6 = 0
C1 C2
1 5
P(C1) = 1/6
C1 C2
2 4
P(C1) = 2/6
P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444 Lect 12/ 14-08-09
30
Examples for computing Entropy Entropy
=
c −1
− ∑ pi ( log 2 p i ) i =0
C1 C2
0 6
P(C1) = 0/6 = 0
C1 C2
1 5
P(C1) = 1/6
C1 C2
2 4
P(C1) = 2/6
P(C2) = 6/6 = 1
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92 Lect 12/ 14-08-09
31
Examples for Computing Error Error (t ) = 1 − max pi i
C1 C2
0 6
P(C1) = 0/6 = 0
C1 C2
1 5
P(C1) = 1/6
C1 C2
2 4
P(C1) = 2/6
P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3 Lect 12/ 14-08-09
32
Comparison among Splitting •This fig compares Criteria the values of the For a 2-class problem: impurity measures for binary classification problems
• ‘p’ refers to the fraction of rec. that belong to one of the two classes •All measures attain their max. value when CD is uniform (p=0.5) and min. values when all records belong to the same class (p=0 or 1) 33
Customer ID
Gender
Car Type
Shirt Size
Class
1
M
Family
S
C0
2
M
Sports
M
C0
3
M
Sports
M
C0
4
M
Sports
L
C0
5
M
Sports
XL
C0
6
M
Sports
XL
C0
7
F
Sports
S
C0
8
F
Sports
S
C0
9
F
Sports
M
C0
10
F
Luxury
L
C0
11
M
Family
L
C1
12
M
Family
XL
C1
13
M
Family
M
C1
14
M
Luxury
XL
C1
15
F
Luxury
S
C1
16
F
Luxury
S
C1
17
F
Luxury
M
C1
18
F
Luxury
M
C1
19
F
Luxury
M
C1
20
F
Luxury
L
C1
34
• Q1: consider the training ex shown in previous slide for a binary classification problem. • (a) compute Gini index for Customer ID attr. • (b) Gini Index for Gender attr. • (c) Gini index for Car type attr. Using multi way split
Lect 12/ 14-08-09
35
Splitting Based on GINI • Used in CART, SLIQ, SPRINT. • When a node p is split into k partitions (children), the quality of split is computed as,
GINI split where,
k
ni = ∑ GINI (i ) i =1 n
ni = number of records at child i,
n = number of records at node p. Lect 12/ 14-08-09
36
Binary Attributes: Computing GINI Index • Splits into two partitions • Effect of Weighing partitions: – Larger and Purer Partitions are sought for. Parent
B? Yes
No
C1
6
C2
6
Gini = 0.500
Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194 Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528
Node N1
Node N2
C1 C2
N1 5 2
N2 1 4
Gini=0.333 Lect 12/ 14-08-09
Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528 = 0.333 37
Categorical Attributes: Computing Gini Index • For each distinct value, gather counts for each class in the dataset • Use the count matrix to make decisions Two-way split (find best partition of values)
Multi-way split CarType
Family Sports Luxury 1 2 1 C1 4 1 1 C2 Gini 0.393
C1 C2 Gini
CarType {Sports, {Family} Luxury} 3 1 2 4 0.400
Lect 12/ 14-08-09
C1 C2 Gini
CarType {Family, {Sports} Luxury} 2 2 1 5 0.419
38
Continuous Attributes: Computing Gini Index • Use Binary Decisions based on one value • Several Choices for the splitting value – Number of possible splitting values = Number of distinct values • Each splitting value has a count matrix associated with it – Class counts in each of the partitions, A < v and A ≥ v • Simple method to choose best v – For each v, scan the database to gather count matrix and compute its Gini index – Computationally Inefficient! Repetition of work.
Taxable Income > 80K? Yes
Lect 12/ 14-08-09
No
39
•
Continuous Attributes: Computing Gini Index... For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index
Cheat
No
No
No
Yes
Yes
Yes
No
No
No
No
100
120
125
220
Taxable Income 60
Sorted Values Split Positions
70
55
75
65
85
72
90
80
95
87
92
97
110
122
172
230
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
Yes
0
3
0
3
0
3
0
3
1
2
2
1
3
0
3
0
3
0
3
0
3
0
No
0
7
1
6
2
5
3
4
3
4
3
4
3
4
4
3
5
2
6
1
7
0
Gini
0.420
0.400
0.375
0.343
0.417
0.400
Lect 12/ 14-08-09
0.300
0.343
0.375
0.400
0.420
40
•
Alternative Splitting Criteria based on INFO Entropy at a given node t: Entropy (t ) = − ∑ p ( j | t ) log p ( j | t ) j
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Measures homogeneity of a node. • Maximum (log nc) when records are equally distributed among all classes implying least information • Minimum (0.0) when all records belong to one class, implying most information
– Entropy based computations are similar to the GINI index computations Lect 12/ 14-08-09
41
Splitting Based on INFO... • Information Gain:
GAIN
n = Entropy ( p ) − ∑ Entropy (i ) n k
split
i
i =1
Parent Node, p is split into k partitions; ni is number of records in partition i
– Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) – Used in ID3 and C4.5 – Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure. Lect 12/ 14-08-09
42
Splitting Based on INFO... • Gain Ratio:
GainRATIO
GAIN n n = SplitINFO = −∑ log SplitINFO n n Split
split
k
i
i
i =1
Parent Node, p is split into k partitions ni is the number of records in partition i
– Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! – Used in C4.5 – Designed to overcome the disadvantage of Information Gain
Lect 12/ 14-08-09
43
Splitting Criteria based on Classification Error • Classification error at a node t :
Error (t ) = 1 − max P (i | t ) i
• Measures misclassification error made by a node. • Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information • Minimum (0.0) when all records belong to one class, implying most interesting information
Lect 12/ 14-08-09
44