multiclass classification with SVM a paper by Vapnik April 26, 2007 here we are given the trainning data set as (xn ,yn ) where yn ∈ (1, 2, 3, ...., m) supposing there are m classes and xn ∈ Rp .we have to extend the algorithm for two class problem to such situations. there are broardly three methods avilable. 1. one vs all method 2. pairwise comparison 3. all together.
1
one vs all method
as the name suggests,this is a method of comparing one class and the rest of the classes are considered as one class.so there would be a total of n boundaries obtained in the same way we had done in case of two class problem. vi (x) be the boundary for the ith class .then for those support vectors yi ∗ vj (xi ) = 1 and for the bounded support vectors,we have yi ∗ vj (xi ) <= 1 for j th class.hence we would classify xi to class j if vj (xi) > 0. but the problem in this procedure is that there can be a number of j’s for which vj (xi) > 0.hence there comes the problem of nonclassifiable points.this is the most serious problem posed by one vs all method.this problem can be identified from the following diagram.here the shaded region is non-classifiable region.
1
hence let us consider some improvements over this method. 1. continuous decision function 2. fuzzy support vector machine 3. decision tree structure. 1.1
continuous decision function
so far we have considered only the signs of vj (x).hence we are discretizing the decision function.while by continuous decision function,we would consider the value of vj (x).hence we would assign xi to class k if k = argmaxvj (x) the logic behind considering the maximum value is that the value of vj (x) can be considered as the membership value of xi in the j th class.hence more the vlue of this membership function,the more feasible for xi in the j th class. 1.2
fuzzy support vector machine
here we extend the concept of membership function to what we call fuzzy support vector machine.the main idea is to extend the concept of membership function to fuzzy membership function.let us define the fuzzy membership function as the following way; for i = j , mii (x) = 1
if vi (x) ≥ 1
=vi (x) and for i,, mij (x) = 1
o.w. if vj (x) ≤ 1
=-vj (x) o.w.here also for an observation xi ,if the value of vk (xi ) exceeds 1 then it is sure to belong to class i and otherwise it may belong to class i or may belong to some other classes.hence we define this kind of membership function.similarly for i,we can define in similar fashion. now having defined the memship function in this way,let us now define mi in two ways. firstly, mi = min m , P j ij and secondly, mi = mij . we assign x to class k if k = argmaxmi (x) 2
1.3
decision tree structure
the third method is called decision tree based support vector machine.here what we do is the following; we first separate one class from the others.in the next step we separate another class from the remainning classes and we repeat this method until every observation is classified into a particular class.this method is easy to implement.however the question comes ”‘which class to be separated first?”’so we adopt two procedures which are called 1. type 1 tree 2. type 2 tree. 1.3.1
type 1 tree
step 1: for each class observations from the trainning set,we calcuP late the class centers.let ci be ith center.then we define ci = x∈ Xi x step 2: let us now define dij = kci − cj k.so dij denotes the distance between the ith and j th class centers. step 3: now we calculate li = minj dij . so li will indicate the minimal distance of ith class from other classes. step 4: next we compute arg maxi li . this value of i will indicate that class which can be considered as the furthest class. so we separate this class from others and we repeat this same operation till all the observations get classified into one particular class. 1.3.2
type 2 tree
step 1: here also we compute ci and dij for each class. step 2: now we consider l’i =minj dij andletk= arg min l’i . step 3. then class k is merged with that class for which dkj is mini-
3
mum. step 4. we repeat this procedure upto (n-2) times until we get two classes which are widely separated and we can separate them easily. step 5. once we get two such classes ,we can do this same procedure for both the classes (if any one class consists of only one class,we don’t have to do this on that class) until we get each observation classified into exactly one class. so through this method we are actually merging all classes into two clusters and then we are separating out those clusters into the parent classes.
2
pairwise comparison
here we consider a total of n × (n − 1)/2 different boundaries for n classes .these boundaries are obtained using pairwise comparison between each pair of classes. here the usefulness of this method is that here the trainning is not that computational intensive.because in one vs all method,we had to train the SVM’s using the entire data set.however here only the trainning set observations of just two classes are used to train the boundaries. let dij (x) be the boundary between ith and j th class.then we would assign x to class if dij (x) ≥ 0 for all j=1,2,....,n. but again the problem of non-classifiable regions can come into picture.for instance there may situations where we cannot have all dij ’s are positive.this problem can be easily seen from the figure.
4
so in order to remove those drawbacks,we again adopt some new developments in this method. 2.1
new developments
there are mainly two new procedure suggested; 1. fuzzy support vector machine 2. decision direct acyclic support vector machine. 2.2
fuzzy support vector machine
here also we define the membership function for each observation xi based on the boundary between ith and j th class which is defined as follows; mij (x) = 1 if dij (x) ≥ 1 =dij (x) o.w. here also we define mi (x) in the similar fashion as we did in the previous situation.and we assign x to class k if k = argmaxi mi (x). 2.3
decision direct acyclic support vector machine
here we would first do classification between any two classes.but the idea is little bit different from what we had seen in the previous situation.here our classification rule would be if for any i, j we have dij (x) ≤ 0,then x is sure not to belong to class i.similarly if dij (x) ≥ 0,then x is sure not belong to class j .basically this leads to the idea of classifying the observations between the two classes.the structure is shown in the figure. we continue until every observation is classified into exactly one class. here the problem of non-classifiable region do exist.
5
2.4
cluster SVM
so far we have seen how to do the pairwise comparison.however it may so happen that number of observations in each class may be too large to handle.in that case,what we do is to divide each class observations into a number of clusters.suppose there are Ni clusters for classi.then we do the same pairwise comparison between each clusters belonging to two diffreent classes.for instance if we considert class i and class j then there would be a total of Ni × Nj pairwise comparison.that means we would get Ni × Nj many boundaries intotal.we would select that boundary for which the error os misclassification is minimum. however it may so happen that the boundaries obtained may not be good at all because we are dealing with two clusters only and not with the entire data set.hence this method is not that popular.
3
some general points
first of all,we cannot use all these methods in all situations.whta method will we use will be dictated by the problem in hand. 1. here one vs all method is used with ease yet if the number of observations in the trainning set is too large then it is not suggested to use this method,in that situations,we can use pairwise comparison and cluster SVM. 2. we cannot use the classical SVM method to solve the classification problem because there may be some regions of unclassifiable regions.
1 here i have not discussed the all together method.i will discuss it in the next presentation topic
6