Implementation of ID3 – Decision Tree Algorithm
Sharad Verma*, Nikita Jain** *
[email protected], Truba College of Engineering & Science/ Computer Engineering INDORE, INDIA **
[email protected], Truba College of Engineering & Science/ Computer Engineering INDORE, INDIA Abstract In this paper we address the issue of decision tree learning algorithm which has been successfully used in expert systems in capturing knowledge. The main task performed in these systems is using inductive methods to the given values of attributes of an unknown object to determine appropriate classification according to decision tree rules.We focus on the problem of decision tree learning with the popular ID3 algorithm. Algorithms have a wide range of applications like churn pre-diction, fraud detection, artificial intelligence, and credit card rating etc. Also there are many classification algorithms available in literature but decision trees is the most commonly used because of its ease of implementation and easier to understand compared to other classification algorithms.
Keywords: Data mining, Decision trees & ID3 Algorithm.
employing a top-down, greedy search through the given sets to test each attribute at every tree node. In order to select the attribute that is most useful for classifying a given sets, we introduce a metric---information gain. To find an optimal way to classify a learning set, what we need to do is to minimize the questions asked (i.e. minimizing the depth of the tree). Thus, we need some function which can measure which questions provide the most balanced splitting. The information gain metric is such a function.
1.1 Entropy In information theory, entropy is a measure of the uncertainty about a source of messages. The more uncertain a receiver is about a source of messages, the more information that receiver will need in order to know what message has been sent. k
Entropy ( S )= −∑pi log pi i =1
1. Introduction A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision. Decision tree are commonly used for gaining information for the purpose of decision -making. Decision tree starts with a root node on which it is for users to take actions. From this node, users split each node recursively according to decision tree learning algorithm. The final result is a decision tree in which each branch represents a possible scenario of decision and its outcome. We demonstrate this on ID3, a wellknown and influential algorithm for the task of decision tree learning. We note that extensions of ID3 are widely used in real market applications. ID3 is a simple decision tree learning algorithm developed by Ross Quinlan (1983). The basic idea of ID3 algorithm is to construct the decision tree by
1.2 Information Gain Measuring the expected reduction in Entropy As we mentioned before, to minimize the decision tree depth, when we traverse the tree path, we need to select the optimal attribute for splitting the tree node, which we can easily imply that the attribute with the most entropy reduction is the best choice. We define information gain as the expected reduction of entropy related to specified attribute when splitting a decision tree node. r
Gain( S , S1..S r ) = Entropy( S ) − ∑
j =1
Sj S
Entropy( S j )
For inductive learning, decision tree learning is attractive for 3 reasons: 1. Decision tree is a good generalization for unobserved
instance, only if the instances are described in terms of features that are correlated with the target concept. 2. The methods are efficient in computation that is proportional to the number of observed training instances. 3. The resulting decision tree provides a representation of the concept that appeal to human because it renders the classification process self-evident.
1.3 Related Work In this paper, we have focused on the problem of minimizing test cost while maximizing accuracy. In some settings, it is more appropriate to minimize misclassification costs instead of maximizing accuracy. For the two class problem, Elkan gives a method to minimize misclassification costs given classification probability estimates. Bradford et al. compare pruning algorithms to minimize misclassification costs. As both of these methods act independently of the decision tree growing process, they can be incorporated with our algorithms (although we leave this as future work). Ling etal propose a cost-sensitive decision tree algorithm that optimizes both accuracy and cost. However, the cost insensitive version of their algorithm (i.e. the algorithm run if all feature costs are zero), reduces to a splitting criteria that maximizes accuracy, which is well known to be inferior to the information gain and gain ratio criterion. Integrating machine learning with program understanding is an active area of current research. Systems that analyze root cause errors in distributed systems and systems that find bugs using dynamic predicates may both benefit from cost sensitive learning to decrease overhead monitoring costs.
2. Classification by Decision Tree Learning This section briefly describes the machine learning and data mining problem of classification and ID3, a well-known algorithm for it. The presentation here is rather simplistic and very brief and we refer the reader to Mitchell [12] for an in-depth treatment of the subject. The ID3 algorithm for generating decision trees was first introduced by Quinlan in [15] and has since become a very popular learning tool.
2.1 The Classification Problem
different values. One of the attributes in the database is designated as the class attribute; the set of possible values for this attribute being the classes. We wish to predict the class of a transaction by viewing only the non-class attributes. This can then be used to predict the class of new transactions for which the class is unknown. For example, the weather problem is a toy data set which we will use to understand how a decision tree is built. It is reproduced with slight modifications in Witten and Frank (1999), and concerns the conditions under which some hypothetical outdoor game may be played. In this dataset, there are five categorical attributes outlook, temperature, humidity, windy, and play. We are interested in building a system which will enable us to decide whether or not to play the game on the basis of the weather conditions, i.e. we wish to predict the value of play using outlook, temperature humidity, and windy. We can think of the attribute we wish to predict, i.e. play, as the output attribute, and the other attributes as input.
2.2 Decision Trees and the ID3 Algorithm The main ideas behind the ID3 algorithm are: 1. Each non-leaf node of a decision tree corresponds to an input attribute, and each arc to a possible value of that attribute. A leaf node corresponds to the expected value of the output attribute when the input attributes are described by the path from the root node to that leaf node. 2. In a “good” decision tree, each non-leaf node should correspond to the input attribute which is the most informative about the output attribute amongst all the input attributes not yet considered in the path from the root node to that node. This is because we would like to predict the output attribute using the smallest possible number of questions on average. The ID3 algorithm assumes that each attribute is categorical, that is containing discrete data only, in contrast to continuous data such as age, height etc. The principle of the ID3 algorithm is as follows. The tree is constructed top-down in a recursive fashion. At the root, each attribute is tested to determine how well it alone classified the transactions. The “best” attribute (to be discussed below) is then chosen and the remaining transactions are partitioned by it. ID3 is then recursively called on each partition (which is a smaller database containing only the appropriate transactions and without the splitting attribute). 2.2.1 ID3 algorithm is best suited for: -
The aim of a classification problem is to classify transactions into one of a discrete set of possible categories. The input is a structured database comprised of attribute-value pairs. Each row of the database is a transaction and each column is an attribute taking on
1. Instance is represented as attribute-value pairs. 2. Target function has discrete output values.
3. Attribute values should be nominal. Figure 1: The ID3 Algorithm for Decision Tree Learning ID3(R, C, T ) 1. If R is empty, return a leaf-node with the class value assigned to the most transactions in T. 2. If T
consists of transactions which all have the same
value look for the class attribute, return a leaf-node with the value c (finished classification path). 3. Otherwise, (a) Determine the attribute that best classified the transactions in T , let it be A. (b) Let a, b the values of attribute A and let T (a 1), ..., T (am) be a partition of T such that every transaction in T(ai) has the attribute value a.
3. Conclusion The paper conducted concludes that ID3 works fairly well on classification problems having datasets with nominal attribute values. It also works well in case of missing attribute values but the way missing attributes are handled actually governs the performance of the algorithm. In case of neglecting instances with missing values for the attribute leads to high error rate compared to selecting the missing value as a separate value. Decision tree induction is one of the classification techniques used in decision support systems and machine learning process. With decision tree technique the training data set is recursively partitioned using depth- first (Hunt’s method) or breadth-first greedy technique (Shafer et al ,1996) until each partition is pure or belong to the same class/leaf node (Hunts et al, 1966 and Shafer et al , 1996). Decision tree model is preferred among other classification algorithms because it is an eager learning algorithm and easy to implement.
(c) Return a tree whose root is labeled A (this is the test attribute) and has edges labeled a1, am such that for every i, the edge a goes to the tree ID3(R − {A}, C, T (ai)).
What remains is to explain how the best predicting attribute is chosen. This is the central principle of ID3 and is based on information theory. The entropy of the class attribute clearly expresses the difficulty of prediction. We know the class of a set of transactions when the class entropy for them equals zero. The idea is therefore to check which attribute reduces the information of the class-attribute to the greatest degree. This results in a greedy algorithm which searches for a small decision tree consistent with the database. The bias favoring short descriptions of a hypothesis is based on Occam’s razor. As a result of this, decision trees are usually relatively small, even for large databases.
2.3 Advantages of using ID3 Understandable prediction rules are created from the training data. Builds the fastest tree. Builds a short tree. Only need to test enough attributes until all data is classified. Finding leaf nodes enables test data to be pruned, reducing number of tests. Whole dataset is searched to create tree.
4. References [1] Tom M. Mitchell, (1997). Machine Learning, Singapore, McGraw- Hill. [2] Usama et al. “On the Handling of Continuous-Values Attributes in Decision Tree Generation”. University of Michigan, Ann Arbor. [3] R. Chmielewski et al. “Global Discretization of Continuous Attributes as Preprocessing for Machine Learning”. Int. Journal of Approximate Reasoning 1996. [4] Dan Ventura et al. “An Empirical Comparison of Discretization Methods”. Proceedings of the Tenth International Symposium on Computer and Information Sciences, pp. 443-450, 1995. [5] Karmaker et al. “Incorporating an EM-Approach for Handling Missing Attribute-Values in Decision Tree Induction”. [6] Stuart Russell, Peter Norvig, 1995. Artificial Intelligence: A Modern Approach New Jersey: Prantice Hall. [7] J.R. Quinlan (1986): “Induction of Decision Tree” Machine Learning, Vol, pp.81-106. [8] M. R. Civanlar and H. J. Trussell, “Constructing membership functions using statistical data,” Fuzzy Sets and Systems, vol. 18, 1986, pp. 1-14.