Support Vector Machines Jie Tang

  • Uploaded by: rabbityeah
  • 0
  • 0
  • April 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Support Vector Machines Jie Tang as PDF for free.

More details

  • Words: 2,065
  • Pages: 28
Introduction to Support Vector Machines Jie Tang 25 July 2005

Introduction • Support Vector Machine (SVM) is a learning methodology based on Vapnik’s statistical learning theory – Addressed in the 1990s – To solve the problems in traditional statistical learning( over fitting, capacity control,…)

• It achieved the best performances in practical applications – Handwritten digit recognition – text categorization…

Classification Problem • Given training set S={(x1,y1),(x2,y2),…, (xl,yl)}, and xi∈ X=Rn, yi∈ Y={1,-1}, i=1,2,…,l • To learn a function g(x), and make the decision function f(x)=sgn(g(x)) can classify new input x • So this is a supervised batch learning method

Linear classifier g ( x)  ( wT x  b)  1, g ( x)  0  sgn( g ( x))     1, g ( x )  0   f ( x)  sgn( g ( x))

Maximum Marginal Classifier

x (1)

x (2)

w b  ˆ  w w

w b  ˆ  w w

Maximum Marginal Classifier  (i )  y ( i ) ( wT x  b) ˆ  min 

• Let us select two points on the two hyperplan respectively. wT x (1)  b  ˆ wT x (2)  b  ˆ w ( x  0) w x

(1)

b  ˆ w  w w

x

(2)

b  ˆ w  w w

Distance from hyperplan to origin

Maximum Marginal Classifier

x (1)

x (2)

w b  ˆ  w w

x (1) x (2)

b  ˆ w  w w w  w

b  ˆ w

2ˆ w

w b  ˆ  w w

Then 2ˆ

max  , w,b

min  , w,b

equal to

w

w 2ˆ

Note: we have constraints

s.t.

wT x ( i )  b  ˆ, 0  i  k T

w x

( j)

 b  ˆ, k  j  m

equal to

By scaling w and b by settingγ=1

min w,b s.t.

w 2 y ( wT x (i )  b)  1, 0  i  m

y ( wT x (i )  b)  ˆ , 0  i  m

Lagrange duality For the problem:

min w,b s.t.

w 2 y ( wT x (i )  b)  1, 0  i  m

We can write lagrange form:

L( w, b)  s.t.

w

m

   i [ y ( wT x (i )  b)  1]

2 i 0  i  0, 0  i  m

Let us review generalized Lagrangian min f ( x) s.t.

gi ( x)  0, 0  i  k h j ( x)  0, 0  j  l

By lagrangian: k

l

i 0

j 0

L( w, b)  f ( x)    i g i ( x)    j h j ( x) s.t.

 i  0,  j  0

Let us consider

RP ( w)  max L( w, b) Note: the constraints must be satisfied, otherwise, the maxL will be infinite.

Let us review generalized Lagrangian If the constraints are satisfied, then we must have

max L  f ( x) Now you can found that, maxL takes the same value as the objective of our problem f(x). Therefore, we can consider the minimization problem

min w RP ( w)  min w max ,  L( w,  ,  ) Let us define the optimal value of the primal problem as p* Then, let us define the dual problem

RD ( ,  )  min w L( w,  ,  ) max ,  RD ( ,  )  max ,  min w L( w,  ,  ) Now we define the optimal value of the dual problem as d*

They are similar

Relationship between Primal and Dual problems d *  max ,  min w L( w,  ,  )  min w max L( w,  ,  )  p* Why? Just remember it Then if under some conditions, d*=p* We can solve the dual problem in lieu of the primal problem

What is the conditions?

The famous KKT conditions Karush-Kuhn-Tucker conditions  L( w* ,  * ,  * )  0, i  [1, m] wi  L( w* ,  * ,  * )  0, i  [1, l ]  i

 i* gi ( w* )  0, i  [1, k ] gi ( w* )  0, i  [1, k ]

 i*  0, i  [1, k ] What does it imply?

The famous KKT conditions Karush-Kuhn-Tucker conditions  i* gi ( w* )  0, i  [1, k ] gi ( w* )  0,  i*  0 gi ( w* )  0,  i*  0???

Very important!

Return to our problem L( w, b,  )  s.t.

w 2

m

   i [ y ( wT x ( i )  b)  1] i 0

y ( wT x ( i )  b)  1, 0  i  m

Let us first solve minwL with respective to the w: m

 w L( w, b,  )  w    i y (i ) x (i )  0 i 1

m

w    i y (i ) x (i ) i 1

m

 b L( w, b,  )    i y (i )  0 i 1

Substitute the two equations back to L(w,b,a)

We have 1 m (i ) ( j ) L( w, b,  )    i   y y  i j ( x ( i ) )T x ( j ) 2 i , j 1 i 0 m

Then, what we have the maximum optimum problem with respect to αβ: m

1 m (i ) ( j ) max L( )    i   y y  i j x ( i ) , x ( j ) 2 i , j 1 i 0

 i  0, i  [1, m]

s.t. m

 y i 1

i

(i )

0

Now, we have only one parameter: α We can solve it and then solve w, And then b, because:

b*  

max i: y( i ) 1 w*T x (i )  min i: y( i ) 1 w*T x (i ) 2

How to predict For a new sample x, we can predict it by: m

w x  b  (  i y ( i ) x ( i ) )T x  b T

i 1 m

   i y (i ) x (i ) , x  b i 1

Non-separable case What is non-separable case? I will not give an example. I suppose you know that Then what is the optimal problem: min w,b s.t.

w 2

m

 C  i i 1

y ( wT x (i )  b)  1  i , 0  i  m

i  0, 0  i  m

Next, by forming the lagrangian: L( w, b,  ,  ,  ) 

w 2

m

m

i 1

i 1

m

 C   i    i [ y ( w x  b)  1   i ]    i  i T

(i )

i 1

Dual form m

1 m (i ) ( j ) max L( )    i   y y  i j x (i ) , x ( j ) 2 i , j 1 i 0 C   i  0, i  [1, m]

s.t. m

 y i 1

i

(i )

0

Also note following conditions:

 i  0  y ( i ) ( wT x ( i )  b)  1  i  C  y (i ) ( wT x ( i )  b)  1 0   i  C  y ( i ) ( wT x ( i )  b)  1

What is the difference from the previous form??!!

How to train SVM = how to solve the optimal problem Sequential minimal optimization (SMO) algorithm, due to John Platt.

First, let us introduce coordinate ascent algorithm: Loop until convergence: { For i=1, …, m { ai:=argmaxaiL(a1,…, ai-1, ai, ai+1,…, am) } }

coordinate ascent is ok? 1 y

(1)

   i y (i )

1   y Is it ok?

m

i2

(1)

m

(i )  y  i i2

SMO Change the algorithm by: this is just SMO Repeat until convergence { 1. select some pair ai and aj to update next. (using a heuristic that tries to pick the two that will allow us to make the biggest progress towards the global maximum). 2. reoptimize L(a) with respect to ai and aj, while holding all the other

a. }

1 y   2 y (1)

(2)

m

   i y ( i )   i 3

1  (   2 y (2) ) y (1) L(a )  L((   2 y (2) ) y (1) ,  2 ,...,  m )

SMO(2) L(a)  L((   2 y (2) ) y (1) ,  2 ,...,  m ) This is a quadratic function in a2. I.e. it can be written as:

a 22  b 2  c

Solving a2 a 22  b 2  c For the quadratic function, we can simply solve it by setting its derivative to zero. Let us use a2new, unclipped as the resulting value.



new 2

 H

if ( 2new,unclipped  H )

   2new   L

if ( L   2new,unclipped  H )



if ( 2new,unclipped  L)

Having find a2, we can go back to find the optimal a1. Please read Platt’s paper if you want to read more details

Kernel 1. Why kernel? 2. What is feature space mapping?

x   ( x) K ( x, z )   ( x)T  ( z )

kernel function

With kernel, what’s more interesting to us?

We can compute kernel without calculating mapping Replace all

x (i ) , x by ( x ( i ) ),  ( x)

1 m (i ) ( j ) max L( )    i   y y  i j  ( x (i ) ),  ( x ( j ) ) 2 i , j 1 i 0 m

s.t.

m

w x  b  (   i y ( i ) x ( i ) )T x  b T

i 1

 i  0, i  [1, m]

m

   i y (i )  ( x (i ) ),  ( x)  b

m

i 1

  i y (i )  0 i 1

Now, we need to compute Φ(x) first. That may be expensive. But with kernel, we can ignore the step. Why? Because, both in the training and test, there have the expression <x, z>.

K ( x , z )   ( x )T  ( z ) For example:

K ( x, z )  exp(

xz 2

2

2

)

References • Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York. 1998. • Andrew Ng. CS229 Lecture notes. Lectures from 10/19/03 to 10/26/03. Part V. Support Vector Machines • CHRISTOPHER J.C. BURGES. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2, 121–167 (1998). 1998 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. • Cristianini, N., Shawe-Taylor, J., An Introduction to Support Vector Machines, Cambridge University Press, (2000).

People • • • • • • •

Vladimir Vapnik. J. Platt J. Platt, N. Cristianini, J. Shawe-Taylor Shawe-Taylor, J. Burges, C. J. C. Thorsten Joachims Etc.

Related Documents


More Documents from ""