Rule based classification Lecture 21/10-09-09 (no class b’coz of placements) Lecture 22/11-09-09 (No class) Lecture 23/12-09-09 (No Class) Lecture 24/14-09-09
L22/11-09-09
1
Building Classification Rules • Direct Method: • Extract rules directly from data. • e.g.: RIPPER, CN2, Holte’s 1R
• Indirect Method: • Extract rules from other classification models (e.g. decision trees, neural networks, etc). • e.g: C4.5rules
L22/11-09-09
2
Direct Method: Sequential Covering Algorithm • Extracts rules directly from data. • Extraction of rules are done one class at a time if no. of classes are more. • The criterion for selecting which class should be considered, depends on no. of factors like class prevalence.
L22/11-09-09
3
Algorithm 1: Let E - training Records, A – set of attr. value pairs, {(Aj,vj)}. 2: Let Yo be an ordered set of classes {y1,y2,y3-------- ,yk}. 3: Let R = { } be initial ∈ rule (decision) list. 4: for each class y Yo- {yk} do 5: while stopping condition is not met do 6:
r
Learn-One-Rule (E, A, y). Remove training records from E that are covered by r. Add r to the bottom of the rule list: R R V r. end while end for Insert the default rule, { } yk, to the bottom of the rule list R
L22/11-09-09
4
Sequential Covering Algorithm (in short for your quick reference) 1. Start from an empty rule set. 2. Extract a rule using the Learn-One-Rule function. 3. Remove training records covered by the rule. 4. Repeat Step (2) and (3) until stopping criterion is met. L22/11-09-09
5
Learn-One-Rule function • Objective - extract a rule that covers maximum of +ve examples and none or few negative examples in the training dataset. • Computationally expensive b’coz of exponential size of search space. • Generates an initial rule and keeps on growing (refining) it till stopping criteria is met. L22/11-09-09
6
Example of Sequential Covering
(ii) Step 1
(i) Original Data L22/11-09-09
7
Example of Sequential Covering… R1
R1
R2 (iii) Step 2
(iv) Step 3
L22/11-09-09
8
Aspects of Sequential Covering • Rule Growing Strategy • Instance Elimination • Rule Evaluation • Stopping Criterion • Rule Pruning
L22/11-09-09
9
Rule Growing
• Two common strategies : 1. General-to-specific 2. Specific-to-general
L22/11-09-09
Tid R 1
Y
2
N 10
General- to- specific r: { }
Rule has poor quality as it covers all examples in the training set
y
Conjuncts are subsequently added to improve the quality of the rule
L22/11-09-09
11
Specific to general Refund=No, Status=Single, Income=85K (Class=Yes)
Refund=No, Status=Single, Income=90K (Class=Yes)
Refund=No, Status = Single (Class = Yes)
(b) Specific-to-general L22/11-09-09
12
Specific-to-general
One of the positive examples are chosen randomly as initial seed
Body temp=warm-blooded, Skin cover=hair, gives birth=yes, aquatic creature=no, Aerial creature= no, has legs= yes, hibernates=no => MAMMALS
Body temp=warm-blooded, Skin cover=hair, gives birth=yes, aquatic creature=no, Aerial creature= no, has legs= yes=> Mammals Skin cover=hair, gives birth=yes, aquatic creature=no, Aerial creature= no, has legs= yes, hibernates=no => MAMMALS L22/11-09-09
One of the conjuncts are removed so that it can cover more +ve examples 13
Rule Evaluation
• Suppose a tr set contains 60 +ve and 100 –ve examples. • Consider two rules: • R1: covers 50 +ve examples and 5 -ve examples • R2: covers 2 +ve ex and 0 –ve ex. The accuracy for R1 is 90.9% and R2 is 100%.
Still R1 is better b’coz its coverage. Other measures clearly state
L22/11-09-09
14
Rule Evaluation • Metrics: – Accuracy
nc = n k
R = 2∑ f i log( f i / ei ) – Likelihood ratio Statistics, i =1 nc + 1 = – Laplace n +k n : total no. of instances
– M-estimate OR
=
nc : Number of instances covered by rule
nc + kp n +k
k : Number of classes
nc + pm = n+m L22/11-09-09
p : Prior probability
15
FOIL’s Information gain (Rule Evaluation Contd…) + covers p0 +ve examples and n0 –ve • r: A examples • Suppose we add a new conjunct B, the extended rule become + covers p1 +ve examples and n1 – • r’: A^B ve examples – Then FOIL IG=
p0 p1 p1 × (log 2 − log 2 ) p1 + n1 p 0 + n0 L22/11-09-09
16
• Q2. Consider two rules: – R1:A C – R2:A ∧B C Suppose R1 is covered by 350 +ve examples and 150 – ve examples, while R2 is covered by 300 +ve examples and 50 –ve examples. Compute the FOIL’s information gain for rule R2 wrt R1.
• Q4. Page no. 317
L22/11-09-09
17
Aspects of Sequential Covering Algorithm • Rule Growing Strategy • Rule Evaluation • Stopping Criterion • Rule Pruning • Instance Elimination L22/11-09-09
18
Stopping Criterion and Rule Pruning
• Stopping criterion
– Compute the gain – If gain is not significant, discard the new rule
• Rule Pruning – Remove one of the conjuncts in the rule – Compare error rate on validation set before and after pruning – If error improves, prune the conjunct
L22/11-09-09
19
Aspects of Sequential Covering Algorithm • Rule Growing Strategy • Rule Evaluation • Stopping Criterion • Rule Pruning • Instance Elimination L22/11-09-09
20
Instance Elimination • Why do we need to eliminate instances? – Otherwise, the next rule is identical to previous rule
R3 R1 +
class = +
+ -
class = -
+
+ + + + + ++ + + + + + -
-
+ +
R + +
+
+ + +
+ +
-
-
instances?
-
-
-
-
•Why do we remove +ve and –ve -Ensure that the next rule is different -Prevent underestimating the accuracy of rule. L22/11-09-09
+
-
21
Indirect methods for rule-based classifiers and Instance-Based Classifiers Lecture 26/17-09-09
L22/11-09-09
22
Indirect Methods (Generating rule set from Decision tree) P No
Yes
Q No
-
Rule Set
R Yes
No
Yes
+
+
Q No
Consider r2, r3, r5
-
r1: (P=No,Q=No) ==> r2: (P=No,Q=Yes) ==> + r3: (P=Yes,R=No) ==> + r4: (P=Yes,R=Yes,Q=No) ==> r5: (P=Yes,R=Yes,Q=Yes) ==>
Yes
+
It shows that the class label is always predicted as + when Q=Yes So we can say r2’: (Q=yes) r3: (P=yes)
+
∧ (R=no)
Simplified rules
+
23
Classification rules extracted from DT Give Birth? Yes
C4.5rules: (Give Birth=No, Lives in water=No, Can Fly=Yes) →
No
Birds
(Give Birth=No, Live in Water=Yes) → Fishes
Live In W ater?
Mammals Yes
(Give Birth=Yes) → Mammals (Give Birth=No, Live in Water=No, Can Fly=No,) → Reptiles
No
( ) → Amphibians
Sometimes Fishes
Can Fly?
Amphibians
Yes Birds
No
Reptiles L22/11-09-09
24
Advantages of Rule-Based Classifiers • • • • •
As highly expressive as decision trees Easy to interpret Easy to generate Can classify new instances rapidly Performance comparable to decision trees
L22/11-09-09
25
Instance–Based Classifiers
L22/11-09-09
26
Eager learners vs. Lazy learners • Eager learners – DT and rule-based classifiers are ex. of eager learners – They are designed to learn a model that maps the input attr. to the class label as soon as training data becomes available.
• Lazy Learners – They delay the process of modeling the tr. Data until it is provided with an unseen instance to be classified. – Instance-based classifiers belong to this class. – They memorizes the entire tr. data and perform classification only when attr. of a test instance matches it completely.
L22/11-09-09
27
Instance-Based Classifiers Set of Stored Cases Atr1
……...
AtrN
• Store the training records • Use training records to predict the class label of unseen cases
Class A B B
Unseen Case
C
Atr1
A
……...
AtrN
C B
L22/11-09-09
28
Instance Based Classifiers • Examples: – Rote-learner (classifier) • Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly. • Its drawback is that some test rec. may not at all be classified b’coz they don’t match any of the instance in the tr. Data. • SLOUTION??
– Nearest neighbor • Uses k “closest” points (nearest neighbors) for performing classification L22/11-09-09
29
Nearest Neighbor Classifiers • Basic idea: The main idea or justification of nearest neighbor classifier is emphasized with the following example:
• If it walks like a duck, quacks like a duck, looks like a duck, then it’s probably a duck Compute Distance
TTraining eRecords s t R e
L22/11-09-09
Choose k of the “nearest” records 30
• A nearest-neighbor classifier represents each instance as a d-dimensional data point in space, where d is the no. of attributes.
L22/11-09-09
31
Nearest-Neighbor Classifiers Unknown record
●
Requires three things – The set of stored records – Distance Metric to compute distance between records – The value of k, the number of nearest neighbors to retrieve
●
To classify an unknown record: – Compute distance to other training records – Identify k nearest neighbors – Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) 32
Definition of Nearest Neighbor
X
(a) 1-nearest neighbor
X
X
(b) 2-nearest neighbor
(c) 3-nearest neighbor
K-nearest neighbors of a record x are data points that have the k smallest distance to x L22/11-09-09
33
Nearest Neighbor Classification • Compute distance between two points: – Euclidean distance
d ( p, q ) =
∑ ( pi − qi ) i Determine the class from nearest neighbor list 2
– take the majority vote of class labels among the k-nearest neighbors
34
Nearest Neighbor Classification… • Choosing the value of k: – If k is too small, sensitive to noise points – If k is too large, neighborhood may include points from other classes
X
K-nearest classification with large k L22/11-09-09
35
Nearest Neighbor Classification… • Scaling issues – Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes – Example: • height of a person may vary from 1.5m to 1.8m • weight of a person may vary from 90lb to 300lb • income of a person may vary from $10K to $1M
L22/11-09-09
36
Nearest neighbor Classification… • k-NN classifiers are lazy learners – It does not build models explicitly – Unlike eager learners such as decision tree induction and rule-based systems – Classifying unknown records are relatively expensive
L22/11-09-09
37
Algorithm • • •
1: Let k be the no. of NN and D be the set of tr. examples. 2: for each test example z=(x’,y’) do 3: compute d(x’,x), the distance between z and every example, (x,y) D. Select Dz D, the set of k closest training • 4: ⊆ examples to z. • 5: y’=
∈
arg max ∑( xi , yi )∈Dz I (v = y i )
•
6: end for
v
L22/11-09-09
38
• Once the NN list is obtained, the test sample is classified based on the majority class of its NN : – Majority voting: y’=
arg max ∑( xi , yi )∈Dz I (v = y i ) v
v is the class label, yi is the class label for one of the NN and I(.) is an indicator function that returns the value 1 if its argument is true and 0 otherwise.
L22/11-09-09
39
• In majority voting approach, every neighbor has the same impact on the classification. (Refer slide 15 fig.) • This factor makes classification algo. sensitive to the choice of k. • In order to reduce this impact of k, we assign weight to each of the distance for NN say xi: wi=1/d(x’,xi)2. L22/11-09-09
40
• As a result of applying wt. to the distance, the tr. Ex that are located far away from z have a weaker impact on the classification. • Using the distance-weighted voting scheme, the class label can be determined: Distance-wtd. Voting y’=
arg max ∑( xi , yi )∈Dz wi × I (v = y i ) v
L22/11-09-09
41
Characteristics • 1. NN classification is a part of instance-based learning. • 2. Lazy learners like NN classifiers do not need model building. • 3. NN classifiers make their predictions based on local information whereas DT and rule-based classifiers attempt to find a global model that fits the entire input space. • 4. Appropriate proximity measures play a significant role in NN classifiers. L22/11-09-09
42