Soft Computing Assignment
BY D.Mohanraj 20052336
Genetic Algorithms Introduction Genetic Algorithms (GAs) are adaptive heuristic search algorithm premised on the evolutionary ideas of natural selection and genetic. The basic concept of GAs is designed to simulate processes in natural system necessary for evolution, specifically those that follow the principles first laid down by Charles Darwin of survival of the fittest. As such they represent an intelligent exploitation of a random search within a defined search space to solve a problem. First pioneered by John Holland in the 60s, Genetic Algorithms has been widely studied, experimented and applied in many fields in engineering worlds. Not only does GAs provide an alternative method to solving problem, it consistently outperforms other traditional methods in most of the problems link. Many of the real world problems involved finding optimal parameters, which might prove difficult for traditional methods but ideal for GAs. However, because of its outstanding performance in optimisation, GAs has been wrongly regarded as a function optimiser. In fact, there are many ways to view genetic algorithms. Perhaps most users come to GAs looking for a problem solver, but this is a restrictive view Herein, we will examine GAs as a number of different things: a. b. c. d. e. f.
GAs as problem solvers GAs as challenging technical puzzle GAs as basis for competent machine learning GAs as computational model of innovation and creativity GAs as computational model of other innovating systems GAs as guiding philosophy
However, due to various constraints, we would only be looking at GAs as problem solvers and competent machine learning here. We would also examine how GAs is applied to completely different fields. Many scientists have tried to create living programs. These programs do not merely simulate life but try to exhibit the behaviours and characteristics of real organisms in an attempt to exist as a form of life. Suggestions have been made that a life would eventually evolve into real life. Such suggestion may sound absurd at the moment but certainly not implausible if technology continues to progress at present rates. Therefore it is worth, in our opinion, taking a paragraph out to discuss how a life is connected with GAs and see if such a prediction is farfetched and groundless.
Brief Overview GAs was introduced as a computational analogy of adaptive systems. They are modelled loosely on the principles of the evolution via natural selection, employing a population of individuals that undergo selection in the presence of variation-inducing operators such as mutation and recombination (crossover). A fitness function is used to evaluate individuals, and reproductive success varies with fitness. The Algorithms 1. Randomly generate an initial population M (0) 2. Compute and save the fitness u (m) for each individual m in the current population M (t) 3. Define selection probabilities p (m) for each individual m in M (t) so that p (m) is proportional to u (m) 4. Generate M (t+1) by probabilistically selecting individuals from M (t) to produce offspring via genetic operators 5. Repeat step 2 until satisfying solution is obtained. The paradigm of GAs described above is usually the one applied to solving most of the problems presented to GAs. Though it might not find the best solution. More often than not, it would come up with a partially optimal solution.
Who can benefit from GA Nearly everyone can gain benefits from Genetic Algorithms, once he can encode solutions of a given problem to chromosomes in GA, and compare the relative performance (fitness) of solutions. An effective GA representation and meaningful fitness evaluation are the keys of the success in GA applications. The appeal of GAs comes from their simplicity and elegance as robust search algorithms as well as from their power to discover good solutions rapidly for difficult highdimensional problems. GAs is useful and efficient when a. The search space is large, complex or poorly understood. b. Domain knowledge is scarce or expert knowledge is difficult to encode to narrow the search space. c. No mathematical analysis is available. d. Traditional search methods fail. The advantage of the GA approach is the ease with which it can handle arbitrary kinds of constraints and objectives; all such things can be handled as weighted components of the fitness function, making it easy to adapt the GA scheduler to the particular requirements of a very wide range of possible overall objectives.
GAs has been used for problem-solving and for modelling. GAs is applied to many scientific, engineering problems, in business and entertainment, including: Optimization: GAs have been used in a wide variety of optimisation tasks, including numerical optimisation, and combinatorial optimisation problems such as travelling salesman problem (TSP), circuit design [Louis 1993] , job shop scheduling [Goldstein 1991] and video & sound quality optimisation. Automatic Programming: GAs has been used to evolve computer programs for specific tasks, and to design other computational structures, for example, cellular automata and sorting networks. Machine and robot learning: GAs has been used for many machine- learning applications, including classification and prediction, and protein structure prediction. GAs have also been used to design neural networks, to evolve rules for learning classifier systems or symbolic production systems, and to design and control robots. Economic models: GAs has been used to model processes of innovation, the development of bidding strategies, and the emergence of economic markets. Immune system models: GAs has been used to model various aspects of the natural immune system, including somatic mutation during an individual's lifetime and the discovery of multi-gene families during evolutionary time. Ecological models: GAs have been used to model ecological phenomena such as biological arms races, host-parasite co-evolutions, symbiosis and resource flow in ecologies. Population genetics models: GAs has been used to study questions in population genetics, such as "under what conditions will a gene for recombination be evolutionarily viable?" Interactions between evolution and learning: GAs has been used to study how individual learning and species evolution affect one another. Models of social systems: GAs has been used to study evolutionary aspects of social systems, such as the evolution of cooperation [Chughtai 1995], the evolution of communication, and trail-following behaviour in ants.
Applications of Genetic Algorithms GA on optimisation and planning: Travelling Salesman Problem The TSP is interesting not only from a theoretical point of view, many practical applications can be modelled as a travelling salesman problem or as variants of it, for example, pen movement of a plotter, drilling of printed circuit boards (PCB), real-world routing of school buses, airlines, delivery trucks and postal carriers. Researchers have tracked TSPs to study bio molecular pathways, to route a computer networks' parallel processing, to advance cryptography, to determine the order of thousands of exposures needed in X-ray crystallography and to determine routes searching for
forest fires (which is a multiple-salesman problem partitioned into single TSPs). Therefore, there is a tremendous need for algorithms. In the last two decades an enormous progress has been made with respect to solving travelling salesman problems to optimality which, of course, is the ultimate goal of every researcher. One of landmarks in the search for optimal solutions is a 3038-city problem. This progress is only partly due to the increasing hardware power of computers. Above all, it was made possible by the development of mathematical theory and of efficient algorithms. Here, the GA approach is discussed. There are strong relations between the constraints of the problem, the representation adopted and the genetic operators that can be used with it. The goal of travelling Salesman Problem is to devise a travel plan (a tour) which minimises the total distance travelled. TSP is NP-hard (NP stands for nondeterministic polynomial time) - it is generally believed cannot be solved (exactly) in time polynomial. The TSP is constrained: a. The salesman can only be in a city at any time b. Cities have to be visited once and only once. When GAs applied to very large problems, they fail in two aspects: a. They scale rather poorly (in terms of time complexity) as the number of cities increases. b. The solution quality degrades rapidly.
Failure of Standard Genetic Algorithm To use a standard GA, the following problems have to be solved: a. A binary representation for tours is found such that it can be easily translated into a chromosome. b. An appropriate fitness function is designed, taking the constraints into account. Non-permutation matrices represent unrealistic solutions, that is, the GA can generate some chromosomes that do not represent valid solutions. This happens: a. In the random initialisation step of the GA. b. As a result of genetic operators (mutation and crossover). Thus, permutation matrices are used. Two tours including the same cities in the same order but with different starting points or different directions are represented by different matrices and hence by different chromosomes, for example: Tour (23541) = tour (12354) A proper fitness function is obtained using penalty-function method to enforce the constraints. However, the ordinary genetic operators generate too many invalid solutions, leading to poor results. Alternative solutions to TSP require new representations (Position Dependent Representations) and new genetic operators.
Evolutionary Divide and Conquer (EDAC) This approach, EDAC [Valenzuela 1995], has potential for any search problem in which knowledge of good solutions for sub problems can be exploited to improve the solution of the problem itself. The idea is to use the Genetic Algorithm to explore the space of problem subdivisions rather than the space of solutions themselves, and thus capitalise on the near linear scaling qualities generally inherent in the divide and conquer approach. The basic mechanisms for dissecting a TSP into sub problems, solving the sub problems and then patching the sub tours together to form a global tour, have been obtained from the cellular dissection algorithms of Richard Karp. Although solution quality tends to be rather poor, Karp`s algorithms possess an attractively simple geometrical approach to dissection, and offer reasonable guarantees of performance. Moreover, EDAC approach is intrinsically parallel. The EDAC approach has lifted the application of GAs to TSP an order or magnitude larger in terms of problem sizes than permutation representations. Experimental results demonstrate the successful properties for EDAC on uniform random points and PCB problems in the range 500 - 5000 cities.
GA in Business and Their Supportive Role in Decision Making Genetic Algorithms have been used to solve many different types of business problems in functional areas such as finance, marketing, information systems, and production/ operations. Within these functional areas, GAs has performed a variety of applications such as tactical asset allocation, job scheduling, machine-part grouping, and computer network design.
Finance Applications Models for tactical asset allocation and international equity strategies have been improved with the use of GAs. They report an 82% improvement in cumulative portfolio value over a passive benchmark model and a 48% improvement over a non-GA model designed to improve over the passive benchmark. Genetic algorithms are particularly well-suited for financial modelling applications for three reasons: 1. They are payoff driven. Payoffs can be improvements in predictive power or returns over a benchmark. There is an excellent match between the tool and the problems addressed. 2. They are inherently quantitative, and well-suited to parameter optimisation (unlike most symbolic machine learning techniques). 3. They are robust, allowing a wide variety of extensions and constraints that cannot be accommodated in traditional methods."
Information Systems Applications Distributed computer network topologies are designed by a GA, using three different objective functions to optimise network reliability parameters, namely diameter, average distance, and computer network reliability. The GA has successfully designed networks with 100 orders of nodes.
GA has also been used to determine file allocation for a distributed system. The objective is to maximise the programs' abilities to reference the file s located on remote nodes. The problem is solved with the following three different constraint sets: 1. There is exactly one copy of each file to be distributed. 2. There may be any number of copies of each file subject to a finite memory constraint at each node. 3. The number of copies and the amount of memory are both limited.
Production/Operation Applications Genetic Algorithm has been used to schedule jobs in a sequence dependent setup environment for a minimal total tardiness. All jobs are scheduled on a single machine; each job has a processing time and a due date. The setup time of each job is dependent upon the job which immediately precedes it. The GA is able to find good, but not necessarily optimal schedules, fairly quickly. GA is also used to schedule jobs in non-sequence dependent setup environment. The jobs are scheduled on one machine with the objective of minimising the total generally weighted penalty for earliness or tardiness from the jobs' due dates. However, this does not guarantee that it will generate optimal solutions for all schedules. GA is developed for solving the machine-component grouping problem required for cellular manufacturing systems. GA provides a collection of satisfactory solutions for a two objective environment (minimising cell load variation and minimising volume of inter cell movement), allowing the decision maker to then select the best alternative.
Role in Decision Making Applying the well established decision processing phase model of Simon (1960), Genetic Algorithms appear to be very well suited for supporting the design and choice phases of decision making. a. In solving a single objective problem, GA designs many solutions until no further improvement (no increase in fitness) can be achieved or some predetermined numbers of generations have evolved or when the allotted processing time is complete. The fit solution in the final generation is the one that maximises or minimises the objective (fitness) function; this solution can be thought of as the GA has recommended choice. Therefore with single objective problems the user of GA is assisted in the choice phase of decision processing. b. When solving multi-objective problems, GA gives out many satisfactory solutions in terms of the objectives, and then allows the decision maker to select the best alternative. Therefore GAs assist with the design phase of decision processing with multi-objective problems. GAs can be of great assistance for examining alternatives since they are designed to evaluate existing potential solutions as well to generate new (and better) solutions for evaluation. Thus GAs can improve the quality of decision making.
Learning Robot behaviour using Genetic Algorithms Robot has become such a prominent tools that it has increasingly taken a more important role in many different industries. As such, it has to operate with great efficiency and accuracy. This may not sound very difficult if the environment in which the robot operates remain unchanged, since the behaviours of the robot could be pre-programmed. However, if the environment is ever-changing, it gets extremely difficult, if not impossible, for programmers to figure out every possible behaviours of the robot. Applying robot in a changing environment is not only inevitable in modern technology, but is also becoming more frequent. This has obviously led to the development of a learning robot. The approach to learning behaviours, which lead the robot to its goal, described here reflects a particular methodology for learning via simulation model. The motivation is that making mistakes on real system can be costly and dangerous.
ON-LINE SYSTEM Target Environment
OFF-LINE SYSTEM Rule Interpreter
Simulation Environment
Rule Interpreter
Learning Module Active Behaviour
Test Behaviour
In addition, time constraints may limit the extent of learning in real world. Since learning requires experimenting with behaviours that might occasionally produce undesirable results if applied to real world. Therefore, as shown in the diagram, the current best behaviour can be place in the real, online system, while learning continues in the off-line system. Previous studies have shown that knowledge learned under simulation is robust and might be applicable to the real world if the simulation is more general (add more noise and distortion). If this is not possible, the differences between the real world and the simulation have to be identified.
GAs' Role Genetic Algorithms are adaptive search techniques that can learn high performance knowledge structures. The genetic algorithms' strength come from the implicitly parallel search of the solution space that it performs via a population of candidate solutions and this population is manipulated in the simulation. The candidate solutions represent every possible behaviour of the robot and based on the overall performance of the candidates, each could be assigned a fitness value. Genetic operators could then be applied to improve the performance of the population of behaviours. One cycle of testing all of the competing behaviour is defined as a generation, and is repeated until a good behaviour is evolved. The good behaviour is then applied to the real world. Also because of the nature of GA, the initial knowledge does not have to be very good.
Conclusion and Future Work The system described has been used to learn behaviours for controlling simulate autonomous underwater vehicles, missile evasion, and other simulated tasks. Future work will continue examining the process of building robotic system through evolution. We want to know how multiple behaviours that will be required for a higher level task interact, and how multiple behaviours can be evolved simultaneously. We are also examining additional ways to bias the learning both with initial rule sets, and by modifying the rule set during evolution through human interaction. Other open problems include how to evolve hierarchies of skills and how to enable the robot to evolve new fitness functions as the need for new skill arises.
Genetic Algorithms for Object Localisation in a Complex Scene In order to provide machines with the ability to interact in complex, real-world environments, and sensory data must be presented to the machine. One such module dealing with sensory input is the visual data processing module, also known as the computer vision module. A central task of this computer vision module is to recognise objects from images of the environment. There are two different parts to computer vision modules, namely segmentation and recognition. Segmentation is the process of finding interested objects while recognition works to see if the located object matches the predefined attributes. Since images cannot be recognised until they have been located and separated from the background, it is of paramount importance that this vision module is able to locate different objects of interest for different systems with great efficiency.
GA parameters The task of locating a particular object of interest in a complex scene is quite simple when cast in the framework of genetic algorithms. The brute force-force method for finding an object in a complex scene is to examine all positions and sizes, with varying degrees of occlusion of the objects, to determine whether the extracted sub image matches a rough notion of what is being sought. This method is immediately dismissed as it is far too computational expensive to achieve. The use of genetic methodology, however, can raise the brute-force setup to an elegant solution to this complex problem. Since the GA approach does well in very large search spaces by working only with a sample available population, the computational limitation of the brute-force method using full search space enumeration does not apply.
Conclusion and Future Work It has been shown that the genetic algorithm perform better in finding areas of interest even in a complex, real-world scene. Genetic Algorithms are adaptive to their environments, and as such this type of method is appealing to the vision community who must often work in a changing environment. However, several improvements must be made in order that GAs could be more generally applicable. Grey coding the field would greatly improve the mutation operation while combing segmentation with recognition so that the interested object could be evaluated at once.
Finally, timing improvement could be done by utilising the implicit parallelization of multiple independent generations evolving at the same time.
Conclusion If the conception of a computer algorithms being based on the evolutionary of organism is surprising, the extensiveness with which this algorithms is applied in so many areas is no less than astonishing. These applications, be they commercial, educational and scientific, are increasingly dependent on this algorithms, the Genetic Algorithms. Its usefulness and gracefulness of solving problems has made it the more favourite choice among the traditional methods, namely gradient search, random search and others. GAs is very helpful when the developer does not have precise domain expertise, because GAs possesses the ability to explore and learn from their domain. In this report, we have placed more emphasis in explaining the use of GAs in many areas of engineering and commerce. We believe that, through working out these interesting examples, one could grasp the idea of GAs with greater ease. We have also discuss the uncertainties about whether computer generated life could exist as real life form. The discussion is far from conclusive and, whether artificial life will become real life, will remain to be seen. In future, we would witness some developments of variants of GAs to tailor for some very specific tasks. This might defy the very principle of GAs that it is ignorant of the problem domain when used to solve problem. But we would realize that this practice could make GAs even more powerful.
Support Vector Machines Introduction to Support Vector Machines A support vector machine (SVM) is a supervised learning technique from the field of machine learning applicable to both classification and regression. Rooted in the Statistical Learning Theory developed by Vladimir Vapnik and co-workers at AT&T Bell Laboratories in 1995, SVMs are based on the principle of Structural Risk Minimization. 1. Non-linearly map the input space into a very high dimensional feature space (the “kernel trick”). a. In the case of classification, construct an optimal separating hyper plane in this space (a maximal margin classifier); or b. In the case of regression, perform linear regression in this space, but without penalising small errors. Sewell (2005) "The support vector machine (SVM) is a universal constructive learning procedure based on the statistical learning theory (Vapnik, 1995)."Cherkassky and Mulier (1998) "The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimensional feature space. In this feature space a linear decision surface is constructed."Cortes and Vapnik (1995) "Support Vector Machines (SVM) is learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory."Cristianini and Shawe-Taylor (2000) "These techniques are then generalized to what is known as the support vector machine, which produces nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space."Hastie, Tibshirani and Friedman (2001) "Support Vector Machines have been developed recently [34]. Originally it was worked out for linear two-class classification with margin, where margin means the minimal distance from the separating hyper plane to the closest data points. SVM learning machine seeks for an optimal separating hyper plane, where the margin is maximal. An important and unique feature of this approach is that the solution is based only on those data points, which are at the margin. These points are called support vectors. The linear SVM can be extended to nonlinear one when first the problem is transformed into a feature space using a set of nonlinear basis functions. In the feature space - which can be very high dimensional - the data points can be separated linearly. An important advantage of the SVM is that it is not necessary to implement this transformation and to determine the separating hyper plane in the possibly very-high dimensional feature space, instead a kernel representation can be
used, where the solution is written as a weighted sum of the values of certain kernel function evaluated at the support vectors."Horváth (2003) in Suykens et al. "With their introduction in 1995, Support Vector Machines (SVMs) marked the beginning of a new era in the learning from examples paradigm. Rooted in the Statistical Learning Theory developed by Vladimir Vapnik at AT&T, SVMs quickly gained attention from the pattern recognition community due to a number of theoretical and computational merits. These include, for example, the simple geometrical interpretation of the margin, uniqueness of the solution, statistical robustness of the loss function, modularity of the kernel function, and over fit control through the choice of a single regularization parameter."Lee and Verri (2002) "Support Vector machines (SVM) are a new statistical learning technique that can be seen as a new method for training classifiers based on polynomial functions, radial basis functions, neural networks, splines or other functions. Support Vector machines use a hyper-linear separating plane to create a classifier. For problems that cannot be linearly separated in the input space, this machine offers a possibility to find a solution by making a non-linear transformation of the original input space into a high dimensional feature space, where an optimal separating hyper plane can be found. Those separating planes are optimal, which means that a maximal margin classifier with respect to the training data set can be obtained. Rychetsky (2001) "A learning machine that is based on the principle of Structural Risk Minimization described above is the Support Vector Machine (SVM). The SVM has been developed by Vapnik and co-workers at AT&T Bell Laboratories [9, 115, 116, and 19]." Rychetsky (2001) "The support vector network implements the following idea [21]: Map the input vectors into a very high dimensional feature space Z through some non-linear mapping chosen a priori. Then construct an optimal separating hyper plane in this space." Vapnik (2003) in Suykens et al. "The support vector machine (SVM) is a supervised learning method that generates input-output mapping functions from a set of labelled training data." Wang (2005) "Support vector machines (SVMs) are a set of related supervised learning methods, applicable to both classification and regression." Wikipedia (2004)
Support Vector Machines vs. Artificial Neural Networks The development of ANNs followed a heuristic path, with applications and extensive experimentation preceding theory. In contrast, the development of SVMs involved sound theory first, then implementation and experiments. A significant advantage of SVMs is that whilst ANNs can suffer from multiple local minima, the solution to an SVM is global and unique. Two more advantages of SVMs are that that have a simple geometric interpretation and give a sparse solution. Unlike ANNs, the computational complexity of SVMs does not depend on the dimensionality of the input space. ANNs use empirical risk minimization, whilst SVMs use structural risk minimization. The reason that SVMs often outperform ANNs in practice is that they deal with the biggest problem with ANNs, SVMs are less prone to over fitting.
"They differ radically from comparable approaches such as neural networks: SVM training always finds a global minimum, and their simple geometric interpretation provides fertile ground for further investigation." Burgess (1998) "Most often Gaussian kernels are used, when the resulted SVM corresponds to an RBF network with Gaussian radial basis functions. As the SVM approach “automatically” solves the network complexity problem, the size of the hidden layer is obtained as the result of the QP procedure. Hidden neurons and support vectors correspond to each other, so the centre problems of the RBF network are also solved, as the support vectors serve as the basis function centres." Horváth (2003) in Suykens et al. "In problems when linear decision hyper planes are no longer feasible (section 2.4.3), an input space is mapped into a feature space (the hidden layer in NN models), resulting in a nonlinear classifier." Kecman p 149
Formalization We are given some training data, a set of points of the form
Where the ci is either 1 or −1, indicating the class to which the point belongs. Each is a pdimensional real vector. We want to give the maximum-margin hyper plane which divides the points having ci = 1 from those having ci = − 1. Any hyper plane can be written as the set of points satisfying
Maximum-margin hyperplane and margins for a SVM trained with samples from two classes. Samples on the margin are called the support vectors.
Where denotes the dot product. The vector hyperplane. The parameter normal vector .
is a normal vector: it is perpendicular to the
determines the offset of the hyperplane from the origin along the
We want to choose the and b to maximize the margin, or distance between the parallel hyperplanes that are as far apart as possible while still separating the data. These hyperplanes can be described by the equations
and
Note that if the training data are linearly separable, we can select the two hyperplanes of the margin in a way that there are no points between them and then try to maximize their distance. By using geometry, we find the distance between these two hyperplanes is
, so we want to minimize
. As we also have to prevent data points falling into the margin, we add the following constraint: for each i either
of the first class or of the second. This can be rewritten as:
We can put this together to get the optimization problem: Minimize (in
)
subject to (for any
)
Primal form The optimization problem presented in the preceding section is difficult to solve because it depends on ||w||, the norm of w, which involves a square root. Fortunately it is possible to alter the equation by substituting ||w|| with without changing the solution (the minimum of the original and the modified equation have the same w and b). This is a quadratic programming (QP) optimization problem. More clearly: Minimize (in
)
Subject to (for any
)
The factor of 1/2 is used for mathematical convenience. This problem can now be solved by standard quadratic programming techniques and programs. Dual form Writing the classification rule in its unconstrained dual form reveals that the maximum margin hyper plane and therefore the classification task is only a function of the support vectors, the training data that lie on the margin. The dual of the SVM can be shown to be the following optimization problem: Maximize (in αi )
Subject to (for any
)
And
The α terms constitute a dual representation for the weight vector in terms of the training set:
Biased and unbiased hyper planes For simplicity reasons, sometimes it is required that the hyper plane passes through the origin of the coordinate system. Such hyper planes are called unbiased, whereas general hyper planes not necessarily passing through the origin are called biased. An unbiased hyper plane can be enforced by setting b = 0 in the primal optimization problem. The corresponding dual is identical to the dual given above without the equality constraint
Transductive support vector machines Transductive support vector machines extend SVMs in that they also take into account structural properties (e.g. correlational structures) of the data set to be classified. Here, in addition to the training set , the learner is also given a set
Of test examples to be classified. Formally, a Transductive support vector machine is defined by the following primal optimization problem: Minimize (in
Subject to (for any
)
and any
)
and
Transductive support vector machines have been introduced by Vladimir Vapnik in 1998.
Properties SVMs belong to a family of generalized linear classifiers. They can also be considered a special case of Tikhonov regularization. A special property is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers.
Extensions to the linear SVM Soft margin In 1995, Corinna Cortes and Vladimir Vapnik suggested a modified maximum margin idea that allows for mislabelled examples. If there exists no hyper plane that can split the "yes" and "no" examples, the Soft Margin method will choose a hyper plane that splits the examples as cleanly as possible, while still maximizing the distance to the nearest cleanly split examples. The method introduces slack variables, ξi, which measure the degree of misclassification of the datum xi
The objective function is then increased by a function which penalizes non-zero ξi, and the optimization becomes a trade off between a large margin, and a small error penalty. If the penalty function is linear, the equation (3) now transforms to
This constraint in along with the objective of minimizing can be solved using Lagrange multipliers. The key advantage of a linear penalty function is that the slack variables vanish from the dual problem, with the constant C appearing only as an additional constraint on the Lagrange multipliers. Non-linear penalty functions have been used, particularly to reduce the effect of outliers on the classifier, but unless care is taken, the problem becomes non-convex, and thus it is considerably more difficult to find a global solution.
Non-linear classification The original optimal hyper plane algorithm proposed by Vladimir Vapnik in 1963 was a linear classifier. However, in 1992, Bernhard Boser, Isabelle Guyon and Vapnik suggested a way to create non-linear classifiers by applying the kernel trick (originally proposed by Aizerman et al...) to maximum-margin hyper planes. The resulting algorithm is formally similar, except that every dot product is replaced by a non-linear kernel function. This allows the algorithm to fit the maximummargin hyper plane in the transformed feature space. The transformation may be non-linear and the transformed space high dimensional; thus though the classifier is a hyper plane in the highdimensional feature space it may be non-linear in the original input space. If the kernel used is a Gaussian radial basis function, the corresponding feature space is a Hilbert space of infinite dimension. Maximum margin classifiers are well regularized, so the infinite dimension does not spoil the results. Some common kernels include,
Polynomial (homogeneous):
Polynomial (inhomogeneous):
Radial Basis Function:
Gaussian Radial basis function:
Hyperbolic tangent: <0
, for γ > 0
, for some (not every) κ > 0 and c
Multiclass SVM Multiclass SVM aims to assign labels to instances by using support vector machines, where the labels are drawn from a finite set of several elements. The dominating approach for doing so is to reduce the single multiclass problem into multiple binary problems. Each of the problems yields a binary classifier, which is assumed to produce an output function that gives relatively large values for examples from the positive class and relatively small values for examples belonging to the negative class. Two common methods to build such binary classifiers are where each classifier distinguishes between (i) one of the labels to the rest (one-versus-all) or (ii) between every pair of classes (oneversus-one). Classification of new instances for one-versus-all case is done by a winner-takes-all
strategy, in which the classifier with the highest output function assigns the class. The classification of one-versus-one case is done by a max-wins voting strategy, in which every classifier assigns the instance to one of the two classes, then the vote for the assigned class is increased by one vote, and finally the class with most votes determines the instance classification.
Structured SVM Support vector machines have been generalized to Structured SVM, where the label space is structured and of possibly infinite size.
Regression A version of a SVM for regression was proposed in 1996 by Vladimir Vapnik, Harris Drucker, Chris Burges, Linda Kaufman and Alex Smola. This method is called support vector regression (SVR). The model produced by support vector classification (as described above) only depends on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin. Analogously, the model produced by SVR only depends on a subset of the training data, because the cost function for building the model ignores any training data that are close (within a threshold ε) to the model prediction.
Implementation The parameters of the maximum-margin hyper plane are derived by solving the optimization. There exist several specialized algorithms for quickly solving the QP problem that arises from SVMs, mostly reliant on heuristics for breaking the problem down into smaller, more-manageable chunks. A common method for solving the QP problem is the Platt's Sequential Minimal Optimization (SMO) algorithm, which breaks the problem down into 2-dimensional sub-problems that may be solved analytically, eliminating the need for a numerical optimization algorithm such as conjugate gradient methods. Another approach is to use an interior point method that uses Newton-like iterations to find a solution of the Karush-Kuhn-Tucker conditions of the primal and dual problems. Instead of solving a sequence of broken down problems, this approach directly solves the problem as a whole. To avoid solving a linear system involving the large kernel matrix, a row rank approximation to the matrix is often used to use the kernel trick.