Support Vector Machine

  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Support Vector Machine as PDF for free.

More details

  • Words: 1,991
  • Pages: 5
Support Vector Machine

1

Support Vector Machine (SVM) M. Rupp, G. Schneider The Support Vector Machine (SVM) belongs to a class of machine learning algorithms that are based on linear classifiers and the “kernel trick”. The latter is a general method, based on the concepts of duality and kernels, which allows algorithms that can be formulated in terms of inner products to be systematically extended to non-linear cases. For brevity and simplicity, we restrict the discussion of SVMs to binary classification tasks.

Linear classification and margins: Consider l samples x1 , K , xl ∈ ℜ d with labels y i ∈ {− 1,+1} (see Fig. 1 for an example). If the samples are linearly separable, there exist (infinitely many) hyperplanes f ( x ) = w, x + b , which correctly classify all samples, that is sgn ( f ( xi )) = y i . Here, a hyperplane is determined by a weight vector w perpendicular to it and a translation b from the origin. Assuming that both given and future samples (and their labels) are independent and from the same distribution, the hyperplane that will perform best on new samples will be the one “furthest away” from both classes. To formalize this notion, we define the margin of f on ( xi , y i ) as yi f (xi ) . The margin is positive for correctly classified examples and negative for misclassified ones. Geometrically, the margin is the distance between the sample and the class-separating hyperplane. The functional margin of f with respect to the whole dataset is the smallest margin over all samples: min 1≤i ≤l y i f ( xi ) . The optimal hyperplane is the one maximizing the functional margin. Fig. 1 shows the optimal hyperplane and its margin for an example dataset. Formally, the optimal hyperplane bisects the two closest points on the convex hulls of both classes [1] (Fig. 2).

Figure 1. Lnear separation of two classes (indoles ■ and piperidines □) in two dimensions (clogP and molar refractivity). The solid line shows the optimal hyperplane, together with its margin (dashed lines) and support vectors (circled). Note that the computation was done on standardized descriptors; otherwise, the difference in scale between the values of the two descriptors would have given more weight to molar refractivity, leading to a different hyperplane and support vectors. This hyperplane would have also separated the two classes, but with a worse generalization performance on new samples.

2

Support Vector Machine

Figure 2. Relationship of the optimal hyperplane to the convex hull. The convex hull of a set of points {x1 , K , xl } is the smallest convex set containing those points. It consists of all linear combinations with weights in [0,1] , that is, all points



l i =1

∑α

α i xi with

i

= 1 and α i ≥ 0 (see Eq. (2)). For

linearly separable datasets as in the given example, the hyperplane with the largest margin bisects the line between the two closest points on the convex hulls of both classes. If the classes are not linearly separable, the optimal hyperplane bisects the reduced convex hulls of the classes (not shown). In the reduced convex hull, the weights α i are bounded from above by a constant D < 1 , which results in a “shrunk” convex hull.

Quadratic optimization and duality: The optimal hyperplane can be found by solving the following optimization problem:

max γ subject to ∀i : γ ≤ y i f ( xi ) and w w ,b ,γ

2

=1

(1)

Note that while the target function sgn f is invariant to re-scaling of w, the functional margin is not. Therefore, in the above optimization problem the length of w is fixed. For w = 1 , the term w, xi measures the length of the perpendicular projection of xi on w, that

is, the distance of xi to the hyperplane. Eq. (1) is called the hard margin SVM as it requires the training points to lie outside the hyperplane’s margin. It can be rewritten as a minimization problem by fixing the functional margin to unity and minimizing the norm of the weight vector instead. Using Lagrange multipliers yields the dual form of the problem (Eq. (2)). Lagrange multipliers are a tool for transforming an optimization problem over a differentiable function in n variables and with k constraints into an equivalent problem with n+k variables and no constraints. [2,3]

min α i ≥0

l 1 l α iα j y i y j xi , x j − ∑ α i subject to ∑ 2 i , j =1 i =1

l

∑yα i

i

=0

(2)

i =1

Eq. (2) has several important properties: It belongs to a well-studied class of problems, namely convex quadratic programs. For such problems, efficient solvers exist, which find the (unique) global optimum, that is, the optimal hyperplane. According to the Karush-KuhnTucker conditions [4,5], α i ≠ 0 if and only if xi lies on the margin of the optimal hyperplane. Such samples are called the support vectors of the solution (Fig. 1). The solution to Eq. (2) is sparse in the sense that only the support vectors are necessary for its evaluation, which allows

3

Support Vector Machine

for fast classification of new samples. This is an important distinction between SVM and ANN (Box 4.4).

Non-linearity and kernels: The most important property of Eq. (2), however, is that the minimized function uses only inner products between the training samples. An inner product (also dot product, scalar product) is any non-negative, real-valued symmetric bilinear form with x, x = 0 ⇔ x = 0 . It encodes information about lengths and angle of its arguments (Eq. (3)).

cos ∠xy =

x, y x y

=

x, y

.

(3)

x, x y , y

A kernel is a function which corresponds to an inner product in a Hilbert space H, e.g. κ : ℜ × ℜ d → ℜ , κ (xi , x j ) = φ ( xi ), φ (x j ) for some transformation φ : ℜ d → H . By d

replacing the standard dot product

in Eq. (2) with a kernel κ(xi , x j ) , the linear

xi , x j

classification takes place in H (which can be high-dimensional and non-linearly derived from ℜ d ), while the computation itself is still done in the original space ℜ d . As the evaluation of κ requires only xi , x j and not φ(xi ), φ(x j ) , this makes the computation feasible even for very high-dimensional H. This procedure is known as the kernel trick. Fig. 3 shows an example of φ : ℜ → ℜ 2 . Another example is the mapping of two circles in ℜ 2 (one enclosing the other,

(

)

each circle representing a class) by ( x, y ) a x, y, x 2 + y 2 . Although it can be helpful, it is in general not necessary to know φ or H explicitly, as long as they exist.

Figure 3. Example of how linear separability can be achieved through non-linear transformation. In the original one-dimensional descriptor space (x-axis), classes A ■ and B □ are not linearly separable as there is no zero-dimensional hyperplane (point) with all samples from A on one side and all samples from B on the other side of it. However, if a second dimension is added (the y-axis) which is quadratic in the first one, e.g. x a ( x, ax + b ) , there is a one-dimensional hyperplane (line) which separates the two classes.

To characterize kernels, consider a kernel matrix G = (κ(xi , x j ))i , j =1,K ,l of kernel evaluations on a finite subset of input samples {xi ∈ ℜ i = 1, K , l}. A square matrix M ∈ ℜl×l is called positive semidefinite if ∀v ∈ ℜ l : v ′Mv ≥ 0 . A function κ : ℜ d × ℜ d → ℜ is a kernel if

4

Support Vector Machine

and only if it is symmetric and all its kernel matrices are positive semidefinite [3]. Popular kernels include the polynomial kernel κ(x, y ) = ( x, y + a ) and the radial basis function b

(

)

kernel κ(x, y ) = exp − x − y / 2σ 2 . The sigmoid κ ( x, y ) = tanh (a x, y + r ) , corresponds to a two-layer neural network (Box 4.4), is not a kernel in general. 2

which

Extensions: Real molecule datasets are often not linearly separable, either because the chosen descriptors do not provide enough information or because the data are corrupted by noise like measurement errors. Enforcing linear separability in H by choosing a sufficiently complex kernel risks severe overfitting as the training data are learned by rote, degrading generalization capability. As a solution, one allows misclassified samples and tries to minimize the resulting violations of the margin: For each training sample xi , define a slack variable ξi = max (γ − yi f (xi ),0) . It assumes a value of zero for correctly classified samples and measures the violation of the margin of its class otherwise. Reformulating Eq. (2) by taking the slack variables into account yields Eq. (4). l

max γ − C ∑ ξ i subject to ∀i : γ − ξ i ≤ y i f ( xi ) and w w,b ,γ

2

=1

(4)

i =1

The parameter C controls the trade-off between model complexity (for C → ∞ one gets the hard margin SVM) and the frequency of error. The corresponding dual problem is given by Eq. (5). l 1 l α α y y x , x − α i subject to ∑ i j i j i j ∑ 0 ≤α i ≤C 2 i , j =1 i =1

min

l

∑yα i

i

=0

(5)

i =1

This is the most commonly used SVM formulation, the soft margin SVM. Finding the optimal soft margin hyperplane corresponds to finding the two closest points on the reduced convex hulls of the two classes (Fig. 2). [6] The ideas underlying SVMs can be extended to other applications like regression (estimation of real instead of binary labels) and novelty detection (also one-class SVM; estimation if a new sample belongs to the same distribution as the training samples).

Summary: SVMs represent a popular class of machine learning algorithms based on the ideas of linear classification, duality and kernels. By solving a convex quadratic optimization problem, a hyperplane is found which optimally separates the training samples in a (non-linearly) transformed space. The solution depends only on a subset of the training samples, the support vectors. SVMs have received considerable attention, mainly due to solid theoretical foundations as well as good and robust performance in practice. The kernel trick underlying SVMs has also been applied to many other algorithms, e.g. nonlinear principle component analysis [7].

[1] K. Bennett, C. Campbell. Support vector machines: Hype or hallelujah? SIGKDD Explorations 2000, 2, 1–13. [2] J.-L. Lagrange. Theorie des fonctions analytiques. Imprimerie de la Republique, Paris, 1797. [3] J. Shawe-Taylor, N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, 2004.

Support Vector Machine

[4] H. Kuhn, A. Tucker. Nonlinear programming. In: Proceedings of the 2nd Berkeley Symposium on Mathematical Statistics and Probabilistics,. University of California Press, 1951, pp. 481–492. [5] B. Schölkopf, A. Smola. Learning with Kernels. MIT Press, Cambridge, 2002. [6] K. Bennett, E. Bredensteiner. Duality and geometry in svm classifiers. In: Proceedings of the 17th International Conference on Machine Learning (ICML 2000; P. Langley, Ed.), Stanford, California, U.S.A., 2000, pp. 57–64. [7] B. Schölkopf, A. Smola, K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 1998, 10, 1299–1319.

5

Related Documents