Introduction to Support Vector Machines Jie Tang 25 July 2005
Introduction • Support Vector Machine (SVM) is a learning methodology based on Vapnik’s statistical learning theory – Addressed in the 1990s – To solve the problems in traditional statistical learning( over fitting, capacity control,…)
• It achieved the best performances in practical applications – Handwritten digit recognition – text categorization…
Classification Problem • Given training set S={(x1,y1),(x2,y2),…, (xl,yl)}, and xi∈ X=Rn, yi∈ Y={1,-1}, i=1,2,…,l • To learn a function g(x), and make the decision function f(x)=sgn(g(x)) can classify new input x • So this is a supervised batch learning method
Linear classifier g ( x) ( wT x b) 1, g ( x) 0 sgn( g ( x)) 1, g ( x ) 0 f ( x) sgn( g ( x))
Maximum Marginal Classifier
x (1)
x (2)
w b ˆ w w
w b ˆ w w
Maximum Marginal Classifier (i ) y ( i ) ( wT x b) ˆ min
• Let us select two points on the two hyperplan respectively. wT x (1) b ˆ wT x (2) b ˆ w ( x 0) w x
(1)
b ˆ w w w
x
(2)
b ˆ w w w
Distance from hyperplan to origin
Maximum Marginal Classifier
x (1)
x (2)
w b ˆ w w
x (1) x (2)
b ˆ w w w w w
b ˆ w
2ˆ w
w b ˆ w w
Then 2ˆ
max , w,b
min , w,b
equal to
w
w 2ˆ
Note: we have constraints
s.t.
wT x ( i ) b ˆ, 0 i k T
w x
( j)
b ˆ, k j m
equal to
By scaling w and b by settingγ=1
min w,b s.t.
w 2 y ( wT x (i ) b) 1, 0 i m
y ( wT x (i ) b) ˆ , 0 i m
Lagrange duality For the problem:
min w,b s.t.
w 2 y ( wT x (i ) b) 1, 0 i m
We can write lagrange form:
L( w, b) s.t.
w
m
i [ y ( wT x (i ) b) 1]
2 i 0 i 0, 0 i m
Let us review generalized Lagrangian min f ( x) s.t.
gi ( x) 0, 0 i k h j ( x) 0, 0 j l
By lagrangian: k
l
i 0
j 0
L( w, b) f ( x) i g i ( x) j h j ( x) s.t.
i 0, j 0
Let us consider
RP ( w) max L( w, b) Note: the constraints must be satisfied, otherwise, the maxL will be infinite.
Let us review generalized Lagrangian If the constraints are satisfied, then we must have
max L f ( x) Now you can found that, maxL takes the same value as the objective of our problem f(x). Therefore, we can consider the minimization problem
min w RP ( w) min w max , L( w, , ) Let us define the optimal value of the primal problem as p* Then, let us define the dual problem
RD ( , ) min w L( w, , ) max , RD ( , ) max , min w L( w, , ) Now we define the optimal value of the dual problem as d*
They are similar
Relationship between Primal and Dual problems d * max , min w L( w, , ) min w max L( w, , ) p* Why? Just remember it Then if under some conditions, d*=p* We can solve the dual problem in lieu of the primal problem
What is the conditions?
The famous KKT conditions Karush-Kuhn-Tucker conditions L( w* , * , * ) 0, i [1, m] wi L( w* , * , * ) 0, i [1, l ] i
i* gi ( w* ) 0, i [1, k ] gi ( w* ) 0, i [1, k ]
i* 0, i [1, k ] What does it imply?
The famous KKT conditions Karush-Kuhn-Tucker conditions i* gi ( w* ) 0, i [1, k ] gi ( w* ) 0, i* 0 gi ( w* ) 0, i* 0???
Very important!
Return to our problem L( w, b, ) s.t.
w 2
m
i [ y ( wT x ( i ) b) 1] i 0
y ( wT x ( i ) b) 1, 0 i m
Let us first solve minwL with respective to the w: m
w L( w, b, ) w i y (i ) x (i ) 0 i 1
m
w i y (i ) x (i ) i 1
m
b L( w, b, ) i y (i ) 0 i 1
Substitute the two equations back to L(w,b,a)
We have 1 m (i ) ( j ) L( w, b, ) i y y i j ( x ( i ) )T x ( j ) 2 i , j 1 i 0 m
Then, what we have the maximum optimum problem with respect to αβ: m
1 m (i ) ( j ) max L( ) i y y i j x ( i ) , x ( j ) 2 i , j 1 i 0
i 0, i [1, m]
s.t. m
y i 1
i
(i )
0
Now, we have only one parameter: α We can solve it and then solve w, And then b, because:
b*
max i: y( i ) 1 w*T x (i ) min i: y( i ) 1 w*T x (i ) 2
How to predict For a new sample x, we can predict it by: m
w x b ( i y ( i ) x ( i ) )T x b T
i 1 m
i y (i ) x (i ) , x b i 1
Non-separable case What is non-separable case? I will not give an example. I suppose you know that Then what is the optimal problem: min w,b s.t.
w 2
m
C i i 1
y ( wT x (i ) b) 1 i , 0 i m
i 0, 0 i m
Next, by forming the lagrangian: L( w, b, , , )
w 2
m
m
i 1
i 1
m
C i i [ y ( w x b) 1 i ] i i T
(i )
i 1
Dual form m
1 m (i ) ( j ) max L( ) i y y i j x (i ) , x ( j ) 2 i , j 1 i 0 C i 0, i [1, m]
s.t. m
y i 1
i
(i )
0
Also note following conditions:
i 0 y ( i ) ( wT x ( i ) b) 1 i C y (i ) ( wT x ( i ) b) 1 0 i C y ( i ) ( wT x ( i ) b) 1
What is the difference from the previous form??!!
How to train SVM = how to solve the optimal problem Sequential minimal optimization (SMO) algorithm, due to John Platt.
First, let us introduce coordinate ascent algorithm: Loop until convergence: { For i=1, …, m { ai:=argmaxaiL(a1,…, ai-1, ai, ai+1,…, am) } }
coordinate ascent is ok? 1 y
(1)
i y (i )
1 y Is it ok?
m
i2
(1)
m
(i ) y i i2
SMO Change the algorithm by: this is just SMO Repeat until convergence { 1. select some pair ai and aj to update next. (using a heuristic that tries to pick the two that will allow us to make the biggest progress towards the global maximum). 2. reoptimize L(a) with respect to ai and aj, while holding all the other
a. }
1 y 2 y (1)
(2)
m
i y ( i ) i 3
1 ( 2 y (2) ) y (1) L(a ) L(( 2 y (2) ) y (1) , 2 ,..., m )
SMO(2) L(a) L(( 2 y (2) ) y (1) , 2 ,..., m ) This is a quadratic function in a2. I.e. it can be written as:
a 22 b 2 c
Solving a2 a 22 b 2 c For the quadratic function, we can simply solve it by setting its derivative to zero. Let us use a2new, unclipped as the resulting value.
new 2
H
if ( 2new,unclipped H )
2new L
if ( L 2new,unclipped H )
if ( 2new,unclipped L)
Having find a2, we can go back to find the optimal a1. Please read Platt’s paper if you want to read more details
Kernel 1. Why kernel? 2. What is feature space mapping?
x ( x) K ( x, z ) ( x)T ( z )
kernel function
With kernel, what’s more interesting to us?
We can compute kernel without calculating mapping Replace all
x (i ) , x by ( x ( i ) ), ( x)
1 m (i ) ( j ) max L( ) i y y i j ( x (i ) ), ( x ( j ) ) 2 i , j 1 i 0 m
s.t.
m
w x b ( i y ( i ) x ( i ) )T x b T
i 1
i 0, i [1, m]
m
i y (i ) ( x (i ) ), ( x) b
m
i 1
i y (i ) 0 i 1
Now, we need to compute Φ(x) first. That may be expensive. But with kernel, we can ignore the step. Why? Because, both in the training and test, there have the expression <x, z>.
K ( x , z ) ( x )T ( z ) For example:
K ( x, z ) exp(
xz 2
2
2
)
References • Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York. 1998. • Andrew Ng. CS229 Lecture notes. Lectures from 10/19/03 to 10/26/03. Part V. Support Vector Machines • CHRISTOPHER J.C. BURGES. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2, 121–167 (1998). 1998 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. • Cristianini, N., Shawe-Taylor, J., An Introduction to Support Vector Machines, Cambridge University Press, (2000).
People • • • • • • •
Vladimir Vapnik. J. Platt J. Platt, N. Cristianini, J. Shawe-Taylor Shawe-Taylor, J. Burges, C. J. C. Thorsten Joachims Etc.