Diploma Thesis - June 2009 - Draft

  • May 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Diploma Thesis - June 2009 - Draft as PDF for free.

More details

  • Words: 8,488
  • Pages: 34
"Gheorghe Asachi" Tehnical University Faculty of Automatic Control and Computer Engineering Departament of Automatic Control and Applied Informatics Iaşi

University Duisburg – Essen Faculty Automatic Control and Complex Sciences Duisburg

Fault diagnosis in linear and nonlinear systems Diploma thesis Lucian-Adrian Huluță

2009

Table of Contents 1.

Introduction ......................................................................................................................................... 3

2.

The Learning Methodology .................................................................................................................. 4 2.1.

Supervised Learning ...................................................................................................................... 4

2.2.

Learning and Generalization ......................................................................................................... 6

2.3.

Improving Generalization ............................................................................................................. 7

2.4.

Fault diagnosis as a classification problem ................................................................................... 8

3.

Support Vector Machines for fault diagnosis ....................................................................................... 8 3.1.

Two-Class Support Vector Machines ........................................................................................ 8

3.1.1.

Hard-Margin Support Vector Machines ................................................................................ 9

3.1.2.

L1 Soft-Margin Support Vector Machines........................................................................... 14

3.1.3.

Mapping to a High-Dimensional Space ............................................................................... 18

3.1.3.1.

Kernel tricks .................................................................................................................... 18

3.1.3.2.

Kernels............................................................................................................................. 18

3.1.4. 3.2.

4.

L2 Soft-Margin Support Vector Machines........................................................................... 19 Multiclass Support Vector Machines ...................................................................................... 22

3.2.1.

One-against-all .................................................................................................................... 22

3.2.2.

One-against-one.................................................................................................................. 22

3.3.

Support Vector Regression ..................................................................................................... 22

3.4.

Advantages and Disadvantages .............................................................................................. 23

Data-driven methods for fault diagnosis ........................................................................................... 24 4.1.

Principal component analysis ..................................................................................................... 24

4.2.

Projection algorithms .................................................................................................................. 24

5.

Application to Tennessee Eastman process as a benchmark ............................................................ 24 5.1.

Problem formulation................................................................................................................... 26

5.2.

Support vector machines & Kernel methods on TE .................................................................... 27

5.3.

PCA on TE .................................................................................................................................... 28

6.

List of Figures ..................................................................................................................................... 30

7.

Abbreviations ..................................................................................................................................... 30

8.

Nomenclature .................................................................................................................................... 30

9.

References .......................................................................................................................................... 31

10.

Annex 1 – Code listing ..................................................................................................................... 32

10.1.

PCA, T2 statistic and Q statistic – algorithms .......................................................................... 32

10.2.

Support vector machines & Kernel methods .......................................................................... 34

2

1. Introduction In attention to the serious consequences of accidents eventually occurring in chemical plants caused by faulty operating conditions, the importance of incipient fault detection and isolation is well recognized. Although catastrophes and disasters due to chemical plant failures may be infrequent, minor incidences (accidents) may be very common, resulting in personal injury, illness, and material loses costing the society large amounts of resources every year. It is difficult for process operators to detect the existence of a fault by examining timesequenced data, among the hundreds of variables recorded in modern chemical processes. The isolation of the fault for discerning its root causes is even a more difficult problem as it requires managing a great amount of information. Thus, these supervision difficulties point out the necessity of developing decision-making supporting tools in order to help process operators managing plant incidences. Different techniques have been developed to detect and diagnose faults in complex chemical plants altogether with computer aided decision-making supporting tools. Parallel to the growing computational power during last decades, diverse statistical techniques have been developed to carry out effective process fault detection. The difficulty to develop accurate first principle models, the highly coupled nature of chemical processes and the overwhelming amount of stored data, have made these statistical techniques one of the most successfully applied tools for process monitoring in chemical industry. On the other hand, the fault diagnosis problem, the isolation and characterization of the detected anomalies, has not been properly solved yet as it still presents important practical limitations. Three main methodologies have been followed in fault diagnosis of chemical processes: 1. Quantitative model-based methods, 2. Qualitative model-based methods 3. Process history-based methods. Several approaches based on the third alternative, such as neural networks or fuzzy systems, have been implemented to overcome the lack of useful quantitative and qualitative process knowledge required on the first two FDS approaches. Besides not requiring first principles or qualitative knowledge, data-based models allow dealing with the intrinsic non-linearity of variables involved in most chemical processes at no additional cost. Even though they are also limited by difficulties on class generalization, multiple

3

fault diagnosis or data uncertainties handling, they are the most popular techniques applied in industry because of their simplicity and their non-linearity processing capabilities. In any case, none of the previously reported works in literature do not guarantee enough confidence and reliability in a general diagnosis problem. Thus all the existing practical limitations make this area an open field requiring further work. This work proposes a new classification approach to face the fault diagnosis problem using a novel technique recently developed in the machine learning area. This novel technique is a process history-based methodology that has proved higher performance than other reported pattern classifiers in different technical areas. The proposed approach use support vector machines (SVM) which are a kernel-based method from the Statistical Learning Theory that has succeeded in many classification problems (i.e. Computational linguistics). Recently, it has been applied to fault diagnosis in chemical plants using the Tennessee Eastman (TE) benchmark.

2. The Learning Methodology The construction of machines capable of learning from experience has for a long time been the object of both philosophical and technical debate. The technical aspect of the debate has received an enormous impetus from the advent of electronic computers. They have demonstrated that machines can display a significant level of learning ability, though the boundaries of this ability are far from being clearly defined. The availability of reliable learning systems is of strategic importance, as there are many tasks that cannot be solved by classical programming techniques, since no mathematical model of the problem is available. So for example it is not known how to write a computer program to perform hand-written character recognition, though there are plenty of examples available. It is therefore natural to ask if a computer could be trained to recognize the letter ‘A' from examples - after all this is the way humans learn to read. We will refer to this approach to problem solving as the learning methodology The same reasoning applies to the problem of finding genes in a DNA sequence, filtering email, detecting or recognizing objects in machine vision, and so on. Solving each of these problems has the potential to revolutionize some aspect of our life, and for each of them machine learning algorithms could provide the key to its solution.

2.1.

Supervised Learning

When computers are applied to solve a practical problem it is usually the case that the method of deriving the required output from a set of inputs can be described explicitly. The

4

task of the system designer and eventually the programmer implementing the specifications will be to translate that method into a sequence of instructions which the computer will follow to achieve the desired effect. As computers are applied to solve more complex problems, however, situations can arise in which there is no known method for computing the desired output from a set of inputs, or where that computation may be very expensive. Examples of this type of situation might be modeling a complex chemical reaction, where the precise interactions of the different reactants are not known, or classification of protein types based on the DNA sequence from which they are generated, or the classification of credit applications into those who will default and those who will repay the loan. These tasks cannot be solved by a traditional programming approach since the system designer cannot precisely specify the method by which the correct output can be computed from the input data. An alternative strategy for solving this type of problem is for the computer to attempt to learn the input/output functionality from examples, in the same way that children learn which sports cars are simply by being told which of a large number of cars are sporty rather than by being given a precise specification of sportiness. The approach of using examples to synthesize programs is known as the learning methodology and in the particular case when the examples are input/output pairs it is called supervised learning. The examples of input/output functionality are referred to as the training data. The input/output pairings typically reflect a functional relationship mapping inputs to outputs, though this is not always the case as for example when the outputs are corrupted by noise. When an underlying function from inputs to outputs exists it is referred to as the target function. The estimate of the target function which is learnt or output by the learning algorithm is known as the solution of the learning problem. In the case of classification this function is sometimes referred to as the decision function. The solution is chosen from a set of candidate functions which map from the input space to the output domain. Usually we will choose a particular set or class of candidate functions known as hypotheses before we begin trying to learn the correct function. For example, so-called decision trees are hypotheses created by constructing a binary tree with simple decision functions at the internal nodes and output values at the leaves. Hence, we can view the choice of the set of hypotheses (or hypothesis space) as one of the key ingredients of the learning strategy. The algorithm which takes the training data as input and selects a hypothesis from the hypothesis space is the second important ingredient. It is referred to as the learning algorithm. In the case of learning to distinguish sports cars the output is a simple yes/no tag which we can think of as a binary output value. For the problem of recognizing protein types, the output value will be one of a finite number of categories, while the output values when modeling a chemical reaction might be the concentrations of the reactants given as real values. A learning problem with binary outputs is referred to as a binary classification problem, one with a finite number of categories as multi-class classification, while for real-valued outputs the problem becomes known as regression.

5

There are other types of learning, for example unsupervised learning considers the case where there are no output values and the learning task is to gain some understanding of the process that generated the data. This type of learning includes density estimation, learning the support of a distribution, clustering, and so on. There are also models of learning which consider more complex interactions between a learner and their environment. Perhaps the simplest case is when the learner is allowed to query the environment about the output associated with a particular input. The study of how this affects the learner's ability to learn different tasks is known as query learning. Further complexities of interaction are considered in reinforcement learning, where the learner has a range of actions at their disposal which they can take to attempt to move towards states where they can expect high rewards. The learning methodology can play a part in reinforcement learning if we treat the optimal action as the output of a function of the current state of the learner. There are, however, significant complications since the quality of the output can only be assessed indirectly as the consequences of an action become clear.

2.2.

Learning and Generalization

We discussed how the quality of an on-line learning algorithm can be assessed in terms of the number of mistakes it makes during the training phase. It is not immediately clear, however, how we can assess the quality of a hypothesis generated during batch learning. Early machine learning algorithms aimed to learn representations of simple symbolic functions that could be understood and verified by experts. Hence, the goal of learning in this paradigm was to output a hypothesis that performed the correct classification of the training data and early learning algorithms were designed to find such an accurate fit to the data. Such a hypothesis is said to be consistent. There are two problems with the goal of generating a verifiable consistent hypothesis. The first is that the function we are trying to learn may not have a simple representation and hence may not be easily verified in this way. An example of this situation is the identification of genes within a DNA sequence. Certain subsequences are genes and others are not, but there is no simple way to categorize which are which. The second problem is that frequently training data are noisy and so there is no guarantee that there is an underlying function which correctly maps the training data. The example of credit checking is clearly in this category since the decision to default may be a result of factors simply not available to the system. A second example would be the classification of web pages into categories, which again can never be an exact science. The type of data that is of interest to machine learning practitioners is increasingly of these two types, hence rendering the proposed measure of quality difficult to implement. There is, however, a more fundamental problem with this approach in that even when we can find a hypothesis that is consistent with the training data; it may not make correct classifications of

6

unseen data. The ability of a hypothesis to correctly classify data not in the training set is known as its generalization, and it is this property that we shall aim to optimize. Shifting our goal to generalization removes the need to view our hypothesis as a correct representation of the true function. If the hypothesis gives the right output it satisfies the generalization criterion, which in this sense has now become a functional measure rather than a descriptional one. In this sense the criterion places no constraints on the size or on the ‘meaning' of the hypothesis - for the time being these can be considered to be arbitrary.

2.3.

Improving Generalization

The generalization criterion places an altogether different constraint on the learning algorithm. This is most amply illustrated by the extreme case of rote learning. Many classical algorithms of machine learning are capable of representing any function and for difficult training sets will give a hypothesis that behaves like a rote learner. By a rote learner we mean one that correctly classifies the data in the training set, but makes essentially uncorrelated predictions on unseen data. For example, decision trees can grow so large that there is a leaf for each training example. Hypotheses that become too complex in order to become consistent are said to overfit. One way of trying to control this difficulty is to restrict the size of the hypothesis, for example pruning the size of the decision tree. The approach that we will adopt is to motivate the trade-off by reference to statistical bounds on the generalization error. These bounds will typically depend on certain quantities such as the margin of the classifier, and hence motivate algorithms which optimize the particular measure. The drawback of such an approach is that the algorithm is only as good as the result that motivates it. On the other hand the strength is that the statistical result provides a well-founded basis for the approach, hence avoiding the danger of a heuristic that may be based on a misleading intuition. The fact that the algorithm design is based on a statistical result does not mean that we ignore the computational complexity of solving the particular optimization problem. We are interested in techniques that will scale from toy problems to large realistic datasets of hundreds of thousands of examples. It is only by performing a principled analysis of the computational complexity that we can avoid settling for heuristics that work well on small examples, but break down once larger training sets are used. The theory of computational complexity identifies two classes of problems. For the first class there exist algorithms that run in time polynomial in the size of the input, while for the second the existence of such an algorithm would imply that any problem for which we can check a solution in polynomial time can also be solved in polynomial time. This second class of problems is known as the NP-complete problems and it is generally believed that these problems cannot be solved efficiently. If no restriction is placed over the set of all possible hypotheses (that is all possible functions from the input space to the output domain), then learning is impossible since no

7

amount of training data will tell us how to classify unseen examples. Problems also arise if we allow ourselves the freedom of choosing the set of hypotheses after seeing the data, since we can simply assume all of the prior probability on the correct hypotheses. In this sense it is true that all learning systems have to make some prior assumption of a Bayesian type often called the learning bias.

2.4.

Fault diagnosis as a classification problem

Following the previous analysis, the fault diagnosis problem could be formulated as a classical classification problem in order to extend the concepts successfully applied in machine learning theory. In this sense, better generalization capabilities or more accurate and flexible pattern training algorithms could overcome limitations on faults isolation or multiple fault management. By adopting this classification view, the faults to be managed in a process industry would be different classes in the general classification problem. As in any classification problem, it will be crucial to properly choose meaningful variables or features to properly represent the diagnosis problem, adopt optimal learning methodologies and follow serious validation procedures to avoid the FDS over-fitting. Under this view the classical fault detection problem is a binary classification (BC) problem, consisting of determining whether the process is in or out of control. The global fault diagnosis problem is considered as a multi-class classification (MC) problem, so that many classes are involved. With the purpose to solve both kinds of problems, different learning algorithms have been developed in the machine learning area obtaining very promising results (i.e. Naive Bayes, KNearest Neighbours, SVM, AdaBoost.MH, etc.).

3. Support Vector Machines for fault diagnosis 3.1. Two-Class Support Vector Machines In training a classifier, usually we try to maximize classification performance for the training data. But if the classifier is too fit for the training data, the classification ability for unknown data, i.e., the generalization ability is degraded. This phenomenon is called overfitting. Namely, there is a trade-off between the generalization ability and fitting to the training data. Determination of decision functions using input-output pairs is called training. Conventional training methods determine the indirect decision functions so that each training input is correctly classified into the class designated by the associated training output. Figure 1 shows an example of the decision functions obtained when the training data of two classes do not overlap.

8

Figure 1 – Simple example of classification

Thus there are infinite possibilities of the positions of the decision functions that correctly classify the training data. Although the generalization ability is directly affected by the positions, conventional training methods do not consider this. In a SVM, the direct decision function that maximizes the generalization ability is determined for a two-class problem. Assuming that the training data of different classes do not overlap, the decision function is determined so that the distance from the training data is maximized. We call this the optimal decision function (Figure 2). For a two-class problem, a support vector machine is trained so that the direct decision function maximizes the generalization ability.

Figure 2 – Optimal decision function

3.1.1. Hard-Margin Support Vector Machines Let 𝑀 𝑚 −dimensional training inputs 𝑥𝑖 (𝑖 = 1. . 𝑀) belong to Class 1 or 2 and the associated labels be 𝑦𝑖 = 1 for Class 1 and -1 for Class 2. If these are linearly separable, we can determine the decision function:

9

𝐷 𝑥 = 𝑤 𝑇 𝑥 + 𝑏,

(1)

where 𝑤 is an 𝑚-dimensional vector, 𝑏 is a bias term, and for 𝑖 = 1. . 𝑀 𝑤 𝑇 𝑥𝑖 + 𝑏

> 0 𝑓𝑜𝑟 𝑦𝑖 = 1 < 0 𝑓𝑜𝑟 𝑦𝑖 = −1

(2)

Because the training data are linearly separable, no training data satisfy 𝑤 𝑇 𝑥 + 𝑏 = 0. Thus to control separability, instead of (2), we consider the following inequalities: 𝑤 𝑇 𝑥𝑖 + 𝑏

≥ 1 𝑓𝑜𝑟 𝑦𝑖 = 1 ≤ −1 𝑓𝑜𝑟 𝑦𝑖 = −1

(3)

Here, 1 and −1 on the right-hand sides of the inequalities can be a constant 𝑎 (> 0) and – 𝑎, respectively. But dividing both sides of the inequalities by 𝑎, (3) is obtained. Equation (3) is equivalent to 𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏) ≥ 1,

𝑖 = 1. . 𝑀

(4)

−1 < 𝑐 < 1

(5)

The hyperplane 𝐷 𝑥 = 𝑤 𝑇 𝑥 + 𝑏 = 𝑐,

forms a separating hyperplane that separates 𝑥𝑖 (𝑖 = 1. . 𝑀). When 𝑐 = 0, the separating hyperplane is in the middle of the two hyperplanes with 𝑐 = 1 and -1. The distance between the separating hyperplane and the training datum nearest to the hyperplane is called the margin (Figure 3).

Figure 3 – Margin

Assuming that the hyperplanes 𝐷 𝑥 = 1 and -1 include at least one training datum, the hyperplane 𝐷 𝑥 = 0 has the maximum margin for −1 < 𝑐 < 1. The region 𝑥 − 1 ≤ 𝐷 𝑥 ≤ 1 is the generalization region for the decision function. The generalization ability depends on

10

the location of the separating hyperplane, and the hyperplane with the maximum margin is called the optimal separating hyperplane (Figure 4). Because we can obtain the same optimal separating hyperplane even if we delete all the data that satisfy the strict inequalities in (2.10), the data that satisfy the equalities are called support vectors (Figure 4).

Figure 4 – Optimal hyperplane

Assume that no outliers are included in the training data and that unknown test data will obey the same probability law as that of the training data. Then it is intuitively clear that the generalization ability is maximized if the optimal separating hyperplane is selected as the separating hyperplane. Now consider determining the optimal separating hyperplane. The Euclidean distance from a training datum x to the separating hyperplane is given by 𝐷(𝑋) / 𝑤 . This can be shown as follows. Because the vector 𝑤 is orthogonal to the separating hyperplane, the line that goes through 𝑥 and that is orthogonal to the hyperplane is given by 𝑎𝑤/ 𝑤 + 𝑥, where 𝑎 is the Euclidean distance from 𝑥 to the hyperplane. It crosses the hyperplane at the point where 𝐷

𝑎𝑤 +𝑥 =0 𝑤

(6)

is satisfied. Solving (6) for 𝑎, we obtain 𝑎 = −𝐷(𝑥)/ 𝑤 . Then all the training data must satisfy 𝑦𝑘 𝐷(𝑥𝑘 ) ≥ 𝛿, 𝑤

𝑘 = 1. . 𝑀

(7)

Where 𝛿 is the margin. Now if (𝑤, 𝑏) is a solution, (𝑎𝑤, 𝑎𝑏) is also a solution, where 𝑎 is a scalar. Thus we impose the following constraint:

11

𝛿 𝑤 =1

(8)

From (7) and (8), to find the optimal hyperplane, we need to find 𝑤, with the minimum Euclidean norm that satisfies (4). Therefore, the optimal separating hyperplane can be obtained by minimizing: 1 𝑤 2 With respect to 𝑤 and 𝑏 subject to the constraints: 𝑄 𝑤 =

𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏) ≥ 1,

2

(9)

𝑖 = 1. . 𝑀

(10)

Here, the square of the Euclidean norm 𝑤 in (9) is to make the optimization problem quadratic programming. The assumption of linear separability means that there exist 𝑤 and 𝑏 that satisfy (9). We call the solutions that satisfy (10) feasible solutions. Because the optimization problem has the quadratic objective function with the inequality constraints, even if the solutions are nonunique, the value of the objective function is unique. Thus nonuniqueness is not a problem for support vector machines. Because we can obtain the same optimal separating hyperplane even if we delete all the data that satisfy the strict inequalities in (2.10), the data that satisfy the equalities are called support vectors1 (Figure 2 and Figure 4). The variables of the convex optimization problem given by (9) and (10) are 𝑤 and 𝑏. Thus the number of variables is the number of input variables plus 1: 𝑚 + 1. When the number of input variables is small, we can solve (9) and (10) by the quadratic programming technique. But, as will be discussed later, because we map the input space into a high-dimensional feature space, in some cases, with infinite dimensions, we convert (9) and (10) into the equivalent dual problem whose number of variables is the number of training data. To do this, we first convert the constrained problem given by (9) and (10) into the unconstrained problem 1 𝑄 𝑤, 𝑏, 𝛼 = 𝑤 𝑇 𝑤 − 2

𝑀

𝛼𝑖 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 − 1

(11)

𝑖=1

Where 𝛼 = (𝛼1 , … , 𝛼𝑀 )𝑇 and 𝛼𝑖 are the nonnegative Lagrange multipliers. The optimal solution of (11) is given by the saddle point, where (11) is minimized with respect to 𝑤 and 𝑏 and maximized with respect to 𝛼𝑖 (≥ 0), and it satisfies the following Karush-Kuhn-Tucker (KKT) conditions:

This definition is imprecise. Data can satisfy 𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏) = 1 but that can be deleted without changing the optimal separating hyperplane. Support vectors are defined using the solution of the dual problem, as discussed later. 1

12

𝜕𝑄(𝑤, 𝑏, 𝛼) =0 𝜕𝑤

(12)

𝜕𝑄(𝑤, 𝑏, 𝛼) =0 𝜕𝑏

(13)

𝛼𝑖 𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏) − 1 = 0,

𝑖 = 1. . 𝑀

(14)

𝛼𝑖 ≥ 0,

𝑖 = 1. . 𝑀

(15)

Especially, the relations between the inequality constraints and their associated Lagrange multipliers given by (14) are called KKT complementarity conditions. In the following, if there is no confusion, the KKT complementarity conditions are called the KKT conditions. From (14), 𝛼𝑖 = 0, or 𝛼𝑖 ≠ 0 and 𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏) = 1 must be satisfied. The training data 𝑥𝑖 with 𝛼𝑖 ≠ 0 are called support vectors.2 Using (11), we reduce (12) and (13), respectively, to 𝑀

𝑤=

𝛼𝑖 𝑦𝑖 𝑥𝑖

(16)

𝑖=1

and 𝑀

𝛼𝑖 𝑦𝑖 = 0

(17)

𝑖=1

Substituting (16) and (17) into (11), we obtain the following dual problem. Maximize 𝑀

𝑄 𝛼 = 𝑖=1

1 𝛼𝑖 − 2

with respect to 𝛼𝑖 subject to the constraints

𝑀

𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖𝑇 𝑥𝑗

(18)

𝛼𝑖 ≥ 0, 𝑖 = 1. . 𝑀

(19)

𝑖,𝑗 =1

𝑀

𝛼𝑖 𝑦𝑖 = 0 , 𝑖=1

The formulated support vector machine is called the hard-margin support vector machine. Because

2

In the definition of support vectors, we exclude the data in which both 𝛼𝑖 = 0 and 𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏) = 1 hold.

13

1 2

𝑀

𝑖,𝑗 =1

1 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖𝑇 𝑥𝑗 = 2

T

𝑀

𝑀

𝛼𝑖 𝑦𝑖 𝑥𝑖 𝑖=1

𝛼𝑖 𝑦𝑖 𝑥𝑖 ≥ 0

(20)

𝑖=1

maximizing (18) under the constraints (19) is a concave quadratic programming problem. If a solution exists, namely, if the classification problem is linearly separable, the global optimal solution 𝛼𝑖 𝑖 = 1. . 𝑀 exists. For quadratic programming, the values of the primal and dual objective functions coincide at the optimal solutions if they exist. This is called the zero duality gap. Data that are associated with positive 𝛼𝑖 are support vectors for Classes 1 and 2. Then from (16) the decision function is given by 𝛼𝑖 𝑦𝑖 𝑥𝑖𝑇 𝑥 + 𝑏 ,

𝐷 𝑥 =

(21)

𝑖∈𝑆

where 𝑆 is the set of support vector indices, and form the KKT conditions given by (14), 𝑏 is given by 𝑏 = 𝑦𝑖 − 𝑤 𝑇 𝑥𝑖

(22)

Where 𝑥𝑖 is a support vector. From the standpoint of precision of calculations, it is better to take the average among the support vectors as follows: 𝑏=

1 𝑆

(𝑦𝑖 − 𝑤 𝑇 𝑥𝑖 )

(23)

𝑖∈𝑆

Then unknown datum 𝑥 is classified into: 𝐶𝑙𝑎𝑠𝑠 1 𝑖𝑓 𝐷 𝑥 > 0 𝐶𝑙𝑎𝑠𝑠 1 𝑖𝑓 𝐷 𝑥 < 0

(24)

If 𝐷 𝑥 = 0, 𝑥 is on the boundary and thus is unclassifiable. When training data are separable, the region 𝑥 − 1 ≤ 𝐷 𝑥 ≤ 1 is a generalization region. 3.1.2. L1 Soft-Margin Support Vector Machines In hard-margin support vector machines, we assumed that the training data are linearly separable. When the data are linearly inseparable, there is no feasible solution, and the hardmargin support vector machine is unsolvable. Here we extend the support vector machine so that it is applicable to an inseparable case. To allow inseparability, we introduce the nonnegative slack variables 𝜉𝑖 (≥ 0) into (4):

14

𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1 − 𝜉𝑖 ,

𝑖 = 1. . 𝑀

(25)

Figure 5 – L1 Soft-Margin slack variables

By the slack variables 𝜉𝑖 , feasible solutions always exist. For the training data 𝑥𝑖 , if 0 < 𝜉𝑖 < −1 (𝜉𝑖 in Figure 5), the data do not have the maximum margin but still correctly classified. But if 𝜉𝑖 > 1 (𝜉𝑗 in Figure 5) the data are misclassified by the optimal hyperplane. To obtain the optimal hyperplane in which the number of training data that do not have the maximum margin as minimum, we need to minimize 𝑀

𝑄 𝑤 =

𝜃 𝜉𝑖 , 𝑖=1

where 𝜃 𝜉𝑖 =

1, 𝜉𝑖 > 0 0, 𝜉𝑖 = 0

But this is a combinatorial optimization problem and difficult to solve. Instead, we consider minimizing 1 𝑄 𝑤, 𝑏, 𝜉 = 𝑤 2

𝑀 2

+𝐶

𝜉𝑖𝑝

(26)

𝑖 = 1. . 𝑀,

(27)

𝑖=1

subject to the constraints 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1 − 𝜉𝑖 ,

where 𝜉 = (𝜉1 , . . . , 𝜉𝑀 )𝑇 and C is the margin parameter that determines the trade-off between the maximization of the margin and minimization of the classification error. We select the value of 𝑝 as wither 1 or 2. We call the obtained hyperplane the soft-margin hyperplane. When 𝑝 = 1, we call the support vector machine L1 soft-margin support vector machine (Figure 5).

15

Similar to the linearly separable case, introducing the nonnegative Lagrange multipliers 𝛼𝑖 and 𝛽𝑖 , we obtain 1 𝑄 𝑤, 𝑏, 𝜉, 𝛼, 𝛽 = 𝑤 2

𝑀 2

+𝐶

𝑀

𝑀

𝛼𝑖 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 − 1 + 𝜉𝑖 −

𝜉𝑖 − 𝑖=1

𝑖=1

𝛽𝑖 𝜉𝑖

(28)

𝑖=1

where 𝛼 = (𝛼1 , … , 𝛼𝑀 )𝑇 and 𝛽 = (𝛽1 , … , 𝛽𝑀 )𝑇 . For the optimal solution, the following KKT conditions are satisfied: 𝜕𝑄(𝑤, 𝑏, 𝜉, 𝛼, 𝛽) = 0, 𝜕𝑤

(29)

𝜕𝑄(𝑤, 𝑏, 𝜉, 𝛼, 𝛽) = 0, 𝜕𝑏

(30)

𝜕𝑄(𝑤, 𝑏, 𝜉, 𝛼, 𝛽) = 0, 𝜕𝜉

(31)

𝛼𝑖 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 − 1 + 𝜉𝑖 = 0,

𝑖 = 1. . 𝑀,

(32)

𝛽𝑖 𝜉𝑖 = 0,

𝑖 = 1. . 𝑀,

(33)

𝛼𝑖 ≥ 0, 𝛽𝑖 ≥ 0, 𝜉𝑖 ≥ 0,

𝑖 = 1. . 𝑀.

(34)

Using (28), we reduce (29) to (31), respectively, to 𝑀

𝑤=

𝛼𝑖 𝑦𝑖 𝑥𝑖

(35)

𝑖=1 𝑀

𝛼𝑖 𝑦𝑖 = 0,

(36)

𝑖=1

𝛼𝑖 + 𝛽𝑖 = 𝐶,

𝑖 = 1. . 𝑀.

(37)

Thus substituting (35) to (37) into (28), we obtain the following dual problem. Maximize 𝑀

𝑄 𝛼 = 𝑖=1

1 𝛼𝑖 − 2

𝑀

𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖𝑇 𝑥𝑗

(38)

𝑖,𝑗 =1

subject to constraints

16

𝑀

𝛼𝑖 𝑦𝑖 = 0 ,

0 ≤ 𝛼𝑖 ≤ 𝐶,

𝑖 = 1. . 𝑀.

(39)

𝑖=1

The only difference between L1 soft-margin support vector machines and hard margin support vector machines is that 𝛼𝑖 cannot exceed 𝐶. Especially, (32) and (33) are called KKT (complementarity) conditions. From these and (37), there are three cases for 𝛼𝑖 : 1. 𝛼𝑖 = 0 then 𝜉𝑖 = 0. Thus 𝑥𝑖 is correctly classified. 2. 0 ≤ 𝛼𝑖 ≤ 𝐶. Then 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 − 1 + 𝜉𝑖 = 0 and then 𝜉𝑖 = 0. Therefore 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 = 1 and 𝑥𝑖 is a support vector. Especially we call the support vector with 𝐶 > 𝛼𝑖 > 0 an unbounded support vector. 3. 𝛼𝑖 = 𝐶. Then 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 − 1 + 𝜉𝑖 = 0 and 𝜉𝑖 ≥ 0. Thus 𝑥𝑖 is a support vector. We call the support vector with 𝛼𝑖 = 𝐶 a bounded support vector. If 0 ≤ 𝜉𝑖 < 1, 𝑥𝑖 is correctly classified, and if 𝜉𝑖 ≥ 1, 𝑥𝑖 is misclassified . The decision function is the same as that of the hard-margin support vector machine and is given by 𝛼𝑖 𝑦𝑖 𝑥𝑖𝑇 𝑥 + 𝑏

𝐷 𝑥 =

(40)

𝑖∈𝑆

where 𝑆 is the set of support vector indices. Because 𝛼𝑖 are nonzero for the support vectors, the summation (40) is added only for the support vectors. For the unbounded 𝛼𝑖 , 𝑏 = 𝑦𝑖 − 𝑤 𝑇 𝑥𝑖

(41)

is satisfied. To ensure the precision of calculations, we take the average of 𝑏 that is calculated for unbounded support vectors, 𝑏=

1 𝑈

(𝑦𝑖 − 𝑤 𝑇 𝑥𝑖 )

(42)

𝑖∈𝑈

where U is the set of unbounded support vector indices. Then unknown datum x is classified into 𝐶𝑙𝑎𝑠𝑠 1 𝑖𝑓 𝐷 𝑥 > 0 𝐶𝑙𝑎𝑠𝑠 1 𝑖𝑓 𝐷 𝑥 < 0

(43)

If 𝐷 𝑥 = 0, 𝑥 is on the boundary and thus is unclassifiable. When there are no bounded support vectors, the region 𝑥 − 1 ≤ 𝐷 𝑥 ≤ 1 is a generalization region, which is the same as the hard-margin support vector machine.

17

3.1.3.

Mapping to a High-Dimensional Space

Figure 6 – Linear inseparability

Figure 7 – Mapping input data

3.1.3.1. Kernel tricks 3.1.3.2. Kernels Examples of kernels: 1. Linear 𝐾 𝑥, 𝑥 ′ = 𝑥 𝑇 𝑥 ′ 2. Polynomial

18

𝐾 𝑥, 𝑥 ′ = (𝑥 𝑇 𝑥 ′ + 1)𝑑 3. Radial Basis Function (RBF) 𝐾 𝑥, 𝑥 ′ = 𝑒 −𝛾

𝑥−𝑥 ′

2

3.1.4. L2 Soft-Margin Support Vector Machines Instead of the linear sum of the slack variables 𝜉𝑖 in the objective function, the L2 softmargin support vector machine uses the square sum of the slack variables (Figure 8). Namely, training is done by minimizing 1 𝐶 𝑄 𝑤, 𝑏, 𝜉 = 𝑤 𝑇 𝑤 + 2 2

𝑀

𝜉𝑖2

(44)

𝑖=1

with respect to 𝑤, 𝑏, and 𝜉 subject to the inequality constraints: 𝑦𝑖 𝑤 𝑇 Φ(𝑥𝑖 ) + 𝑏 ≥ 1 − 𝜉𝑖 ,

𝑖 = 1. . 𝑀

(45)

Here, 𝑤 is 𝑙-dimmensional vector, 𝑏 is the bias, Φ(𝑥) is the mapping function that maps the 𝑚-dimmensional vector x into the 𝑙-dimmensional feature space (will be discussed in 3.1.3 Mapping to a High-Dimensional Space section), 𝜉𝑖 is the slack variable for variable 𝑥𝑖 , and 𝐶 is the margin parameter.

Figure 8 – L2 Soft-Margin slack variables

Introducing the Lagrange multipliers 𝛼𝑖 (≥ 0), we obtain

19

1 𝐶 𝑄 𝑤, 𝑏, 𝜉, 𝛼 = 𝑤 𝑇 𝑤 + 2 2

𝑀

𝑀

𝜉𝑖2 − 𝑖=1

𝛼𝑖 𝑦𝑖 𝑤 𝑇 Φ(𝑥𝑖 ) + 𝑏 − 1 + 𝜉𝑖

(46)

𝑖=1

Here, we do not need to introduce the Lagrange multipliers associated with 𝜉. As is shown imemediately, 𝐶𝜉𝑖 = 𝛼𝑖 is satisfied for the optimal solution. Hence 𝜉𝑖 is nonnegative, so long 𝛼𝑖 is nonnegative. For the optimal solution the following KKT conditions are satisfied: 𝜕𝑄(𝑤, 𝑏, 𝜉, 𝛼) =𝑤− 𝜕𝑤

𝑀

𝑦𝑖 𝛼𝑖 Φ(𝑥𝑖 ) = 0, 𝑖=1

𝜕𝑄(𝑤, 𝑏, 𝜉, 𝛼) = 𝐶𝜉𝑖 − 𝛼𝑖 = 0, 𝜕𝜉𝑖 𝜕𝑄(𝑤, 𝑏, 𝜉, 𝛼) = 𝜕𝑏

(47)

(48)

𝑀

𝑦𝑖 𝛼𝑖 = 0,

(49)

𝑖=1

𝛼𝑖 𝑦𝑖 𝑤 𝑇 Φ(𝑥𝑖 ) + 𝑏 − 1 + 𝜉𝑖 = 0,

𝑖 = 1. . 𝑀,

(50)

Ecuation (50) gives the KKT complementarity conditions; from (47), (48), and (50), the optimal solution must satisfy either 𝛼𝑖 = 0 or 𝑀

𝑦𝑖

𝑦𝑗 𝛼𝑗 𝐾 𝑥𝑖 , 𝑥𝑗 + 𝑗 =1

𝛿𝑖𝑗 +𝑏 −1=0 𝐶

(51)

Where 𝐾 𝑥, 𝑥 ′ = Φ𝑇 𝑥 Φ(𝑥) and 𝛿𝑖𝑗 is Kronecker’s delta function, in which 𝛿𝑖𝑗 = 1 for 𝑖 = 𝑗 and 0 otherwise. Thus the bias term 𝑏 is calculated for 𝛼𝑖 > 0: 𝑀

𝑏 = 𝑦𝑖 −

𝑦𝑗 𝛼𝑗 𝐾 𝑥𝑖 , 𝑥𝑗 + 𝑗 =1

𝛿𝑖𝑗 𝐶

(52)

which is different from that of the L1 support vector machine. But the decision function is the same: 𝑀

𝐷 𝑥 =

𝛼𝑖 𝑦𝑖 𝐾 𝑥𝑖 , 𝑥 + 𝑏

(53)

𝑖=1

Substituting (47) to (49) into (46), we obtain the dual objective function:

20

𝑀

𝑄 𝛼 = 𝑖=1

1 𝛼𝑖 − 2

𝑀

𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐾 𝑥𝑖 , 𝑥𝑗 + 𝑖,𝑗 =1

𝛿𝑖𝑗 𝐶

(54)

Thus the following dual problem is obtained. Maximize (54) subject to 𝑀

𝛼𝑖 𝑦𝑖 = 0 ,

𝛼𝑖 ≥ 0,

𝑖 = 1. . 𝑀.

(55)

𝑖=1

This is similar to a hard-margin support vector machine. The difference is the addition of in (54). Therefore, for L1 support vector machine, if we replace 𝐾 𝑥𝑖 , 𝑥𝑗 with 𝐾 𝑥𝑖 , 𝑥𝑗 +

𝛿 𝑖𝑗 𝐶 𝛿 𝑖𝑗 𝐶

and remove the upper bound given by 𝐶 for 𝛼𝑖 , we obtain L2 support vector machine. But we must notice that when we calculate the decision function in (53) we must not add

𝛿 𝑖𝑗 𝐶

.

Because 1/𝐶 is added to the diagonal elements of the matrix 𝐾 = 𝐾 𝑥𝑖 , 𝑥𝑗 called kernel matrix, resulting matrix becomes positive definite. Thus the associated optimization problem is more computationally stable than that of the L1 support vector machine. L2 soft-margin support vector machines look similar to hard-margin support vector machines. Actually, letting 𝑤=

𝑤 , 𝐶𝜉

𝑏 = 𝑏,

Φ 𝑥𝑖 =

Φ 𝑥𝑖 , 𝑦𝑖 𝑒𝑖 𝐶

(56)

where 𝑒𝑖 is the 𝑀-dimensional vector with the 𝑖th element being 1 and the remaining elements 0, training L2 support vector machine given by (44) and (45) is converted to the following problem. Minimize 1 𝑇 𝑤 𝑤 2

(57)

subject to 𝑦𝑖 𝑤 𝑇 Φ 𝑥𝑖 + 𝑏 ≥ 1,

𝑖 = 1. . 𝑀.

(58)

Therefore, the L2 support vector machine is equivalent to the hard-margin support vector machine with the augmented feature space. Because the L2 support vector machine always has a solution because of the slack variables, the equivalent hard-margin support vector machine also has a solution. But this only means that the solution is non-overlapping in the augmented feature space. Therefore, there may be cases where the solution is overlapped in the original

21

feature space, and thus the recognition rate of the training data for the L2 support vector machine is not 100 percent. 3.2.

Multiclass Support Vector Machines

3.2.1.

One-against-all

3.2.2.

One-against-one

3.3. Support Vector Regression The Support Vector method can also be applied to the case of regression, maintaining all the main features that characterize the maximal margin algorithm: a non-linear function is learned by a linear learning machine in a kernel-induced feature space while the capacity of the system is controlled by a parameter that does not depend on the dimensionality of the space. As in the classification case the learning algorithm minimizes a convex functional and its solution is sparse.

Figure 9 – The insensitive band for a one dimensional linear regression problem

22

Figure 10 – The linear (red) and quadratic (green) 𝝐-insensitive loss for zero and non-zero 𝝐

3.4. Advantages and Disadvantages The advantages of support vector machines over multilayer neural network classifiers are as follows: 1. Maximization of generalization ability. In training a multilayer neural network classifier, the sum-of-squares error between outputs and desired training outputs is minimized. Thus, the class boundaries change as the initial weights change. So does the generalization ability. Thus, especially when training data are scarce and linearly separable, the generalization ability deteriorates considerably. But because a support vector machine is trained to maximize the margin, the generalization ability does not deteriorate very much. 2. No local minima. A multilayer neural network classifier is known to have numerous local minima, and there have been extensive discussions on how to avoid a local minimum in training. But because a support vector machine is formulated as a quadratic programming problem, there is a global optimum solution. 3. Robustness to outliers. Multilayer neural network classifiers are vulnerable to outliers because they use the sum-of-squares errors. Thus to prevent the effect of outliers, outliers need to be eliminated before training, or some mechanism for suppressing outliers needs to be incorporated in training. In support vector machines the margin parameter 𝐶 controls the misclassification error. If a large value is set to 𝐶, misclassification is suppressed, and if a small value is set, training data that are away from the gathered data are allowed to be misclassified. Thus by properly setting a value to 𝐶, we can suppress outliers. The disadvantages of support vector machines explained so far are as follows.

23

1. Extension to multiclass problems. Unlike multilayer neural network classifiers, support vector machines use direct decision functions. Thus an extension to multiclass problems is not straightforward, and there are several formulations. 2. Long training time. Because training of a support vector machine is done by solving the associated dual problem, the number of variables is equal to the number of training data. Thus for a large number of training data, solving the dual problem becomes difficult from both the memory size and the training time. 3. Selection of parameters. In training a support vector machine, we need to select an appropriate kernel and its parameters, and then we need to set the value to the margin parameter 𝐶. To select the optimal parameters to a given problem is called model selection. This is the same situation as that of neural network classifiers. Namely, we need to set the number of hidden units, initial values of weights, and so on. In support vector machines, model selection is done by estimating the generalization ability through repeatedly training support vector machines. But because this is time-consuming, several indices for estimating the generalization ability have been proposed.

4. Data-driven methods for fault diagnosis 4.1.

Principal component analysis

4.2.

Projection algorithms

5. Application to Tennessee Eastman process as a benchmark The case involves the production of two products from four reactants. There are 41 measured process variables and 12 manipulated variables. The process (Figure 11) is composed of five main unit operations: an exothermic two-phase reactor, a product condenser, a flash separator, a reboiled stripper, and a recycle compressor. All the 20 proposed faults in the original paper (Downs & Vogel, 1993) were considered to face a global and complex diagnosis problem. Main diagnosis difficulties in this problem come from the number of different faults to be managed and the wide faults diversity. Random and step biases as well as soft drifts on process variables must be managed in this complex diagnosis case study.

24

Figure 11 – Case study: Tennessee Eastman process flowsheet

Besides, some of the considered disturbances cause insignificant biases in variables because of the adjusted control strategy, which makes very difficult to find out the biases source causes. Therefore, it represents a challenging real diagnosis problem for testing and comparing the approach proposed in this work. Three main limitations were found when revising former approaches to compare the performance of a novel fault diagnosis system: 1. Reported results are very limited from the quantitative point of view. No common or standard quantitative indices have been used for that purpose; 2. Unclear distinctions between the detection and diagnosis steps reduce the chances for systematic comparisons; 3. Since many simulation runs may be obtained from the same case study (different ways of sampling, several initial and final conditions, simulation span, etc.), complete data sets are required to be agreed for establishing a unique comparative basis;

25

Classification tools are one of the most extended fault diagnosis systems in literature. They are based on mapping input data from process plant to different process states. The classification methodologies have succeeded because of their implementation simplicity, so they do not require process operators’ experience or process first principles knowledge. Among these classification methodologies, data-based rules systems and machine learning are the most commonly applied methodologies. However, techniques coming from the machine learning area have not been properly exploited to solve significant limitations on chemical process fault diagnosis. Only neural networks, a technique overcome in other technical fields, has been widely applied as a pattern learning technique. Thus, considering fault diagnosis of chemical process as a classification problem would allow taking advantage of the benefits gained by the learning community throughout last decades.

5.1.

Problem formulation

The fault diagnosis problem could be formulated as a classical classification problem in order to extend the concepts successfully applied in machine learning theory. In this sense, better generalization capabilities or more accurate and flexible pattern training algorithms could overcome limitations on faults isolation or multiple fault management. By adopting this classification view, the faults to be managed in a process industry would be different classes in the general classification problem. As in any classification problem, it will be crucial to properly choose meaningful variables or features to properly represent the diagnosis problem, adopt optimal learning methodologies and follow serious validation procedures to avoid the FDS overfitting. Under this view the classical fault detection problem is a binary classification (BC) problem, consisting of determining whether the process is in or out of control. The global fault diagnosis problem is considered as a multi-class classification (MC) problem, so that many classes are involved. With the purpose to solve both kinds of problems, different learning algorithms have been developed in the machine learning area obtaining very promising results (i.e. Naive Bayes, KNearest Neighbours, SVM, AdaBoost.MH, etc.). Some of these algorithms employ BC techniques, but can also be applied to solve multi-class problems when they are adapted by techniques like “one versus all” or “the constraint classification”. In order to deal with the multi-class or fault diagnosis problem the “one versus all” methodology was applied throughout the paper. It consists of decomposing the problem in as many binary problems as classes has the original problem; then, one classifier is trained for each class trying to separate the examples of that class (positives) from the examples of all other classes (negatives).This method generates classifiers between each class and the set of all

26

other classes. When classifying a new example, all binary classifiers predict a class and the one with highest confidence is selected (“winner-take-all” strategy).

5.2.

Support vector machines & Kernel methods on TE

Figure 12 – Evolution of Error when changing arguments on

27

5.3.

PCA on TE

Figure 13 – PCA Method on for datasets D00, D10 and D12

28

Figure 14 – Q statistic for fault 12

2

Figure 15 – T statistic for fault 12

29

6. List of Figures Figure 1 – Simple example of classification .................................................................................... 9 Figure 2 – Optimal decision function .............................................................................................. 9 Figure 3 – Margin .......................................................................................................................... 10 Figure 4 – Optimal hyperplane ..................................................................................................... 11 Figure 5 – L1 Soft-Margin slack variables ..................................................................................... 15 Figure 6 – Linear inseparability ..................................................................................................... 18 Figure 7 – Mapping input data...................................................................................................... 18 Figure 8 – L2 Soft-Margin slack variables ..................................................................................... 19 Figure 9 – The insensitive band for a one dimensional linear regression problem ..................... 22 Figure 10 – The linear (red) and quadratic (green) 𝝐-insensitive loss for zero and non-zero 𝝐 ... 23 Figure 11 – Case study: Tennessee Eastman process flowsheet .................................................. 25 Figure 12 – Evolution of Error when changing arguments on ...................................................... 27 Figure 13 – PCA Method on for datasets D00, D10 and D12 ....................................................... 28 Figure 14 – Q statistic for fault 12 ................................................................................................ 29 Figure 15 – T2 statistic for fault 12 ................................................................................................ 29

7. Abbreviations SVM FDS TE PCA PSVM FDA BC MC KKT

Support Vector Machines Fault Diagnosis System Tennessee Eastman Principial Component Analysis Parallel support vector machines Fisher Discriminant Analysis Binary Classification Multi Class Classification Karush-Kuhn-Tucker

8. Nomenclature Are used lowercase letters to denote vectors and uppercase italic letters to denote matrices. The following list shows the symbols used in this diploma project. 𝑥𝑖

𝛼𝑖 𝛽𝑖 𝜉𝑖

Training data Lagrange multiplier for 𝑥𝑖 Lagrange multiplier for 𝑥𝑖 slack variable associated with 𝑥𝑖

30

𝐶 𝑑 Φ(𝑥) 𝛾 𝐾(𝑥, 𝑥 ′ ) 𝑙 𝑀 𝑚 𝑛 𝑆 𝑈 𝑥 𝑏 𝐷 𝑥 𝑄 𝑥 𝑤

margin parameter degree of a polynomial kernel mapping function from 𝑥 to the feature space parameter for a radial basis function kernel kernel dimension of the feature space number of training data number of input variables number of classes set of support vector indices set of unbounded support vector indices Euclidean norm of vector 𝑥 bias decision fuction cost function Weights

9. References 1. Nello Cristianini, John Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, 2000; 2. John Shawe-Taylor, Nello Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004; 3. Support vector machines (SVMs), http://en.wikipedia.org; 4. Ovidiu Ivanciuc, Applications of Support Vector Machines in Chemistry. In: Reviews in Computational Chemistry, Volume 23, Eds.: K. B. Lipkovitz and T. R. Cundary. Wiley-VCH, Weinheim, 2007, pp. 291-400; 5. Shigeo Abe, Support Vector Machines for Pattern Classification, Springer-Verlag London Limited, 2005; 6. Chih-Wei Hsu and Chih-Jen Lin, A Comparison of Methods for Multiclass Support Vector Machines, IEEE Transactions on Neural Networks, VOL. 13, NO. 2, March 2002; 7. Abhijit Kulkarni, V.K. Jayaraman, B.D. Kulkarni, Knowledge incorporated support vector machines to detect faults in Tennessee Eastman Process, Computers and Chemical Engineering 29, pp. 2128–2133, Elsevier Ltd., 2005 8. A. Mathur, G. M. Foody, Multiclass and Binary SVM Classification: Implications for Training and Classification Users, IEEE Geoscience and Remote Sensing Letters, vol. 5, no. 2, april 2008;

31

9. Shawn Martin, The Numerical Stability of Kernel Methods, Sandia National Laboratories Albuquerque, NM, USA, december 2005; 10. A. Mauricio Sales Cruz, Tennessee Eastman Plant-wide Industrial Process. Challenge Problem. Complete Model, Department of Chemical Engineering Technical University of Denmark, January 2004; 11. Ignacio Yélamos, Gerard Escudero, Moisès Graells, Luis Puigjaner, Performance assessment of a novel fault diagnosis system based on support vector machines, Computers and Chemical Engineering 33, 244–255, 2009;

10. Annex 1 – Code listing 10.1.

PCA, T2 statistic and Q statistic – algorithms

clear;clc;close all; load load load load

data/d00.dat data/d00_te.dat data/d10_te.dat data/d12_te.dat

X_1=d10_te; X_2=d12_te;%with faults X_3=d00_te;%without faults num_row_X_3=size(X_3,1);num_col_X_3=size(X_3,2); num_row_X_2=size(X_2,1);num_col_X_2=size(X_2,2); num_row_X_1=size(X_1,1);num_col_X_1=size(X_1,2); Mn_x_3= Mean(X_3); Std_x_3= Std(X_3); %Normalization X_norm_3=(X_3kron(Mn_x_3,ones(num_row_X_3,1)))./kron(Std_x_3,ones(num_row_X_3,1)); X_norm_2=(X_2kron(Mn_x_3,ones(num_row_X_2,1)))./kron(Std_x_3,ones(num_row_X_2,1)); X_norm_1=(X_1kron(Mn_x_3,ones(num_row_X_1,1)))./kron(Std_x_3,ones(num_row_X_1,1));

TestMn_x_3=Mean(X_norm_3); TestStd_x_3=Std(X_norm_3); TestMn_x_2=Mean(X_norm_2); TestStd_x_2=Std(X_norm_2); TestMn_x_1=Mean(X_norm_1); TestStd_x_1=Std(X_norm_1); S_3=(X_norm_3'*X_norm_3)/(num_row_X_3-1);

32

%eigen decomposition obtaining eigenvalues and eigenvectors [V_3,D_3]=eigs(S_3) a=2; % set number of principle compoments Represented_variance_2 = (D_3(1,1)+D_3(2,2))/sum(diag(D_3)); P=V_3(:,1:2); T_3=X_norm_3*P; T_2=X_norm_2*P; T_1=X_norm_1*P; %calculate threshold T_threshold=a*(num_row_X_3-1)*(num_row_X_3+1)/(num_row_X_3*(num_row_X_3a))*finv(0.99,a,(num_row_X_3-a)); %ploting PCA b=sqrt(D_3(1,1)*T_threshold); c=sqrt(D_3(2,2)*T_threshold); theta=[0:pi/50:2*pi]; t_1=b*cos(theta); t_2=c*sin(theta); figure; plot(T_3(:,1),T_3(:,2),'x',T_2(:,1),T_2(:,2),'*',T_1(:,1),T_1(:,2),'o',t_1,t_ 2); xlabel('t_1'); ylabel('t_2'); legend('Class 3','Class 2','Class 1','95% Threshold','location','northwest'); title('PCA method'); %Calulate T Square T_UCL_square=(((num_row_X_2)^2-1)*a*finv(0.99,a,num_row_X_2a))/(num_row_X_2*(num_row_X_2-a)); T_Square = []; Y=T_2*Represented_variance_2^-.5; for i=1:num_row_X_2 yy=Y(i,1:a)'; T_Square(i)=yy'*yy; end %ploting T square statistic figure plot(1:num_row_X_2,T_Square,1:num_row_X_2,T_threshold); title('T square statistic'); xlabel('time'); ylabel('T2'); %calculate Q statistic q1Array = []; [n m] = size(X_2); I = eye(m); X_norm_residual = X_norm_3 - X_norm_3*P*P'; Co_x=(X_norm_residual'*X_norm_residual); [M,N]=eigs(Co_x);

33

N=N/(num_row_X_2-1); therta_1=sum(diag(N)); therta_2=sum(diag(N.^2)); therta_3=sum(diag(N.^3)); h_0=1-(2*therta_1*therta_3/(3*(therta_2^2))); part_1=norminv(0.99)*sqrt(2*therta_2*(h_0^2))/therta_1; part_2=therta_2*h_0*(h_0-1)/(therta_1^2); %calculate Squared Prediction Error SPE_UCL_x=therta_1*((part_1+part_2+1)^inv(h_0)); %ploting Q statistic for i=1:n xi = X_norm_2(i,:); q1Array(i) = xi*(I-P*P')*xi'; end; figure plot(1:n,q1Array,1:n,SPE_UCL_x); xlabel('time'); ylabel('Q'); title('Q statistic');

10.2.

Support vector machines & Kernel methods

34

Related Documents