summary ::a tutorial on SVM in pattern recognition a paper by J.C.Burges anirban sanyal
1
introduction
the main topics of my discussion can be broadly classified into the following points; 1 . brief history of SVM 2 . separating hyperplane theory. 3 . application of SVM.
2
brief history of SVM
support vector machine was primarily invented by computer science engineers.later on it was used in classification problem. 1. in 1979,SVM was first introduced by Vapnik. 2. however it was only in 1992,SVM was used by Vapnik in hand written digit recognition.it was the starting and later on in 1995,1998 the application of SVM was recognized in pattern classification and other classification problems. 3. initially SVM was used for two class classification problems,later on in 2002-03 ,it was entended to multiclass classification problem.
3
objective
since this is a supervised classification probelm,we have one trainning set and one teest set observations.we have to fit a boundary over the trainning set and this boundary is applied on test set data to classify the test set observations. here we have the trainning set as (xn ,yn ) where yn ∈ {−1, 1} andxn ∈ Rp . 1 we call those observations with yn = 1 as the positive observations and those with yn = −1 as negative observation. let us now consider the following three situations; 1. separable classes 2. non-separable classes. 3. non-linear boundary. we will develop the algorithm of SVM from situation 1 and will try to modify the algorithm in the subsequent cases.
4
completely separable classes
as we can see from the picture that the two clases are completely separable andf hence there can be a number of different boundaries(or hyperplane) that can separate these two classes,as we can see from the picture(2) 4.1
separating hyperplane
by separating hyperplane ,we mean that hyperplane that separates these two classes completely.as we can from the figure,all the lines corrsponds to different separating hyperplanes.a general hyperplane can be represented by w0 ∗ x + a = 0......(*) so for two completely separating classes,there can be a number of choices for w, a. now the question that comes is that ”‘what will be the optimal separating hyperplane of these two classes?”’but before going to the derivation of optimal hyperplane,let me define ”‘what is meant by optimal hyperplane?”’ let us now define d+=the minimum distance from the potive points from the hyperplane in (*) and similarly d− is defined for negative points. so we define the optimal hyperplane as that hyperplane that keeps equal distance from the nearest points of either classes .so for the optimal hyperplane d+ = d− thus for all the points of the optimal hyperplane in (*), w0 ∗ x + a ≥ d for y = +1 and w0 ∗ x + a ≤ −dfor y = −1 where d is the common value of d+ and d-.with proper reparametrization,we have w0 ∗ x + a ≥ 1 and w0 ∗ x + a ≤ −1 for all the points for either classes.and equality holds for those points which are nearest to the hyperplane.such points are called ”‘support vector”’.and here lies the significance of the name ”‘support vector machine”’ comes.in the later sections,we will see that these are the vectors that helps us derive the actual value of w, a.
2
5
derivation of optimal hyprplane
for support vectors,the perpendicular distance from the hyperplane is given by kw’*xk + ak/kwki.e; 1/kwk. since for the optimal hyperplane ,this band width should be maximum,we have to minimize kwk. thus our objective function will be to minimize kwk subject to the condition that yn ×(w’*xn + a) ≥1.
6
optimisation problem
we form the Lagrangian as P L=0.5* kwk2 + λi ∗ (yi ∗ (w0 ∗ xi + a − 1).we have to minimize L with respect to w and a where λi ’s are lagrangian multipliers and λi ≥ 0 with strict inequality holds if xi correspond to the support vectors. ∂L ∂w
∂L =w− λi × yi ×xi sowehave ∂w =0 P ⇒w− λi × yi ×xi =0 P ⇒w= λi × yi ×xi ,here the sum is taken over all the support vectors. we put the value of w in L and so the dual of the problem reduces to P P L= λi − ( 21 ) × λi × λj ×xi ×xj .now it is a convex quadretic programming problem since the objective function is convex itself.so we have to maximize L over λi to find the optimal value of lagrangian multipliers. (1)
P
. the value of a is obtained from the relation for the support vectors, we have , w0 x+a=+1 or -1.
7
overlapping classes
now let us extend this concept to overlapping classes.clearly we cannot extend this things straight way to the overlapping classes.so we do a little bit of modification.we introduce some slack variables ηi 3
where ηi ≥ 0.so we have xi .w+a ≥1-ηi for yi = +1 .......(1) xi .w+a ≤ηi − 1 for yi = −1.......(2) P thus for a misclassified point,ηi ≥ 1 and hence ηi gives a upper bound of the error of misclassification. P so our objective function will be to minimize kwk + ηi subject to the condition that (1) and (2) holds.
8
optimisation problem
here the lagrangian would be
L1 =0.5∗kwk2 +C × ηi + P
P
λi ×(yi ×(xi .w+a)-1+ηi )−
P
µi ×ηi
and we have to differentiate L1 with respect to w to get the optimal value of w so that we have ∂L1 =0 P ∂w ⇒w= λi ×yi ×xi and the value of s can be found from the support vectors once we get the value of w is obtained. (2)
9
non-linear boundaries
but there can atill be situations with boundaries which cannot be approximated by linear boundaries.in those cases we transform the data into a new data set so that the boundary in the new data set is linear or approximately linear.so we transform the data xn to zn such that zn =φ(xn ) where φ is the required transformation.that is φ :Rp −→H .here p ≤ dim(H ).here H may be hilbert space,euclidean space.for the sake of simplicity,let us consider H to be Euclidean space. P so here w= λi ×yi ×φ(xi ) and the optimal hyperplane is P λi × yi × φ(xi )’ ×φ(x)+a=0.......(3). hence the transformation in the optimal hyperplane appears as an inner product .so if we can find one kernel that will help us to avoid 4
finding the actual transformation.for this instance,let us suppose thta such kernels do exist that can express the inner product between two vectors,then we can easily use that kernel to express the optimal hyperplane in (3).let us call this kernel as K(.).then optimal hyperplane is given by P λi × yi × K(xi ,x)+a= 0. the question comes that ”‘1. what guartees the existance of such kernels?”’ ”’2. what would the appropriate kernel in classification problems?”’
10 10.1
existance of kernel Mercer’s condition
there exists a kernel K and a mapping φ such that K(x, y=φ(x) × φ(y) iff for any g(x) such that R 2 g (x)dx <∞ R then, K(x, y) × g(x) × g(y)dxdy ≥ 0 then such P kernels K(,.,)can be expressed as K(x,y) = cp × (x.y)p where cp are positive constants and the series is uniformly convergent.
11
some appropriate kernels
there are some appropriate kernels which are used in classification problems; 1. K(x, y)=exp(kx − yk)2 (Gaussain kernel) 2. K(x, y)=(1 + x.y)p (polynomial kernel) these kernels satisfy Mercer’s condition.
12
some important points in SVM
1. here the optimal ,we are getting is global optimal rather than local optimal.hence there is no chance that we will be struck at local optimal which is usually the case in neural networks. 2. in classification problems,polynomial kernels are found to be most
5
useful. 3. even the boundary is non-linear ,we don’t have to think about the appropriate transformations because once we get the kernel,we can easily compute the optimal hyperplane.
13
disadvantages of SVM
1. if the dimension of data is too large ,then the computation of SVM is quite extensive 2. also in some situstions the appropriate kernels cannot be found out easily.
14
references
1. a tutorial on support vector machine for pattern recognition (J.C.Burges) 2.C.Cortes and Vapnik
6