L
4
L
l
The Perceptron
4.1 Introduction The perceptron is the simplest form of a neural network used for the classification of a special type of patterns said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane). Basically, it consists of a single neuron with adjustable synaptic weights and threshold, as shown in Fig. 4.1. The algorithm used to adjust the free parameters of this neural network first appeared in a learning procedure developed by Rosenblatt (1958, 1962) for his perceptron brain model.’ Indeed, Rosenblatt proved that if the patterns (vectors) used to train the perceptron are drawn from two linearly separable classes, then the perceptron algorithm converges and positions the decision surface in the form of a hyperplane between the two classes. The proof of convergence of the algorithm is known as the perceptron convergence theorem. The single-layer perceptron depicted in Fig. 4.1 has a single neuron. Such a perceptron is limited to performing pattern classification with only two classes. By expanding the output (computation) layer of the perceptron to include more than one neuron, we may correspondingly form classification with more than two classes. However, the classes would have to be linearly separable for the perceptron to work properly. Accordingly, insofar as the basic theory of a single-layer perceptron as a pattern classifier is concerned, we need consider only the single-neuron configuration shown in Fig. 4.1. The extension of the theory so developed to the case of more than one neuron is a trivial matter.
Organization of the Chapter This chapter on the single-layer perceptron is relatively short, but it is important for historical and theoretical reasons. In Section 4.2 we present some basic considerations involved in the operation of the perceptron. Then, in Section 4.3 we describe Rosenblatt’s original algorithm for adjusting the synaptic weight vector of the perceptron for pattern classification of linearly separable classes, and demonstrate convergence of the algorithm. This is followed by the description of a performance measure for the single-layerperceptron in Section 4.4. In Section 4.5 we consider the relationship between the single-layer perceptron and the maximum-likelihood Gaussian classifier. The chapter concludes with general discussion in Section 4.6. The network organization of the original version of the perceptron, as envisioned by Rosenblatt (1962), has three types of units: sensory units, association units, and response units. The connections from the sensory units to the association units have fixed weights, and the connections from the association units to the response units have variable weights. The association units act as preprocessors designed to extract a pattern from the environmental input. Insofar as the variable weights are concerned, the operation of Rosenblatt’s original perceptron is essentially the same as that depicted in Fig. 4.1 for the case of a single response unit. 106
4.2 / Basic Considerations 107
Inputs
FIGURE 4.1
Single-layer perceptron.
4.2 Basic Considerations Basic to the operation of Rosenblatt's perceptron is the McCulloch-Pitts model of a neuron. From Chapter 1, we recall that such a neuron model consists of a linear combiner followed by a hard limiter, as depicted in Fig. 4.2. The summing node of the neuron model computes a linear combination of the inputs applied to its synapses, and also accounts for an externally applied threshold. The resulting sum is applied to a hard limiter. Accordingly, the neuron produces an output equal to +1 if the hard limiter input is positive, and -1 if it is negative. In the signal-flow graph model of Fig. 4.2, the synaptic weights of the single-layer perceptron are denoted by wlr w2,. . . , w,. (We have simplified the notation here by not including an additional subscript to identify the neuron, since we only have to deal with a single neuron.) Correspondingly, the inputs applied to the perceptron are denoted by xlrxz,. . . , xp.The externally applied threshold is denoted by 8. From the model of Fig. 4.2 we thus find that the linear combiner output (i.e., hard limiter input) is
u = C wiXi- e i=l
(4.1)
The purpose of the perceptron is to classify the set of externally applied stimuli xl,xz, . . . , xp into one of two classes, Cel or %z, say. The decision rule for the classification is to assign the point represented by the inputs xl,xz, . . . , xp to class (e if the perceptron output y is + 1 and to class CeZ if it is - 1. To develop insight into the behavior of a pattern classifier, it is customary to plot a map of the decision regions in the p-dimensional signal space spanned by the p input variables xl, xz,. . . , xp. In the case of an elementary perceptron, there are two decision regions separated by a hyperplane defined by
i:wixi - o = o i=l
1. X1
\
w.
x2 4
Inputs
/'
Threkld
P
FIGURE 4.2 Signal-flow graph of the perceptron.
(4.2)
108 4 / The Perceptron
This is illustrated in Fig. 4.3 for the case of two input variables x, and xz, for which the decision boundary takes the form of a straight line. A point (xl,xz)that lies above the boundary line is assigned to class %1, and a point (xI,xz)that lies below the boundary line is assigned to class (e2. Note also that the effect of the threshold 8 is merely to shift the decision boundary away from the origin. . . . , w pof the perceptron can be fixed or adapted on an The synaptic weights wlr wz, iteration-by-iterationbasis. For the adaptation, we may use an error-correctionrule known as the perceptron convergence algorithm, developed in the next section.
4.3 The Perceptron Convergence Theorem For the development of the error-correction learning algorithm for a single-layer perceptron, we find it more convenient to work with the modified signal-flow graph model of Fig. 4.4. In this second model, which is equivalent to that of Fig. 4.2, the threshold e(n) is treated as a synaptic weight connected to a fixed input equal to - 1. We may thus define the ( p + 1)-by-1 input vector
Correspondingly, we define the ( p w(n)
=
+ 1)-by-1 weight vector w&), wh),. . . , wp(n)lr
(4.4)
Accordingly, the linear combiner output is written in the compact form
u(n) = wT(n)x(n)
(4.5)
For fixed n, the equation wTx = 0, plotted in a p-dimensional space with coordinates xl, x2, . . . , x,, defines a hyperplane as the decision surface between two different classes of inputs. Suppose then the input variables of the single-layer perceptron originate from two linearly separable classes that fall on the opposite sides of some hyperplane. Let XI be the subset of training vectors xl(l), x1(2), . . . that belong to class Gel, and let X 2 be the
x2
I
X1
Decision boundary wlxl + w2x2 - 8 = 0
FIGURE 4.3 Illustrating linear reparability for a two-dimensional, two-class patternclassification problem.
4.3 I The Perceptron Convergence Theorem 109
Fixed x,,=-Io, input
Y
Inputs W
Y
Linear combiner
. .xo
FIGURE 4.4
Hard limiter
Equivalent signal-flow graph of the perceptron.
subset of training vectors x2(1), x2(2), . . . that belong to class S2.The union of XI and X , is the complete training set X. Given the sets of vectors XI and Xzto train the classifier, the training process involves the adjustment of the weight vector w in such a way that the two classes and %, are separable. These two classes are said to be linearly separable if a realizable setting of the weight vector w exists. Conversely, if the two classes % I and %, are known to be linearly separable, then there exists a weight vector w such that we may state
wTx 2 0 for every input vector x belonging to class (4.6)
and
wrx < 0 for every input vector x belonging to class %, Given the subsets of training vectors XI and X 2 , the training problem for the elementary perceptron is then to find a weight vector w such that the two inequalities of Eq. (4.6) are satisfied. The algorithm for adapting the weight vector of the elementary perceptron may now be formulated as follows:
1. If the nth member of the training vector, x(n), is correctly classified by the weight vector w(n)computed at the nth iteration of the algorithm, no correction is made to the weight vector of the perceptron, as shown by
w(n
+ 1) = w(n)
if w'(n>x(n) 2 0 and x(n) belongs to class (4.7)
and
w(n
+ 1) = w(n)
if w'(n)x(n) < 0 and x(n)belongs to class %,
2. Otherwise, the weight vector of the perceptron is updated in accordance with the rule
w(n + 1) = w(n) - ~(n)x(n)
if wT(n)x(n)2 0 and x(n) belongs to class sz (4.8)
and
w(n + 1) = w(n) + 7(n)x(n)
if w'(n)x(n)
< 0 and x(n) belongs to class
where the learning-rate parameter ~ ( ncontrols ) the adjustment applied to the weight vector at iteration n.
If q(n) = 7 > 0, where 7 is a constant independent of the iteration number n, we have a f i e d increment adaptation rule for the perceptron. In the sequel, we first prove the convergence of a fixed increment adaptation rule for which 7 = 1. Clearly, the value of 77 is unimportant, so long as it is positive. A value
110 4 / The Perceptron
of q f: 1 merely scales the pattern vectors without affecting their separability. The case of a variable q(n) is considered later in the section. The proof is presented for the initial condition w(0) = 0. Suppose that wT(n)x(n)< 0 for n = 1, 2, . . . and an input vector x(n) that belongs to the subset X , . That is, the perceptron incorrectly classifies the vectors x(l), x(2), . . . , since the second condition of Eq. (4.6) is violated. Then, with the constant ~ ( n =) 1, we may use the second line of Eq. (4.8) to write
w(n
+ 1) = w(n) + x(n)
for x(n) belonging to class
(4.9)
Given the initial condition w(0) = 0, we may iteratively solve this equation for w(n obtaining the result
w(n + 1)
=
+ x(2) + - - + x(n)
x(1)
+ l), (4.10)
*
Since the classes (el and %2 are assumed to be linearly separable, there exists a solution wo for which wTx(n) > 0 for the vectors x( I), . . . , x(n) belonging to the subset X,. For a fixed solution wo, we may then define a positive number a by the relation
(4.11) where x(n) E Xl stands for “x(n) belongs to subset XI.’’ Hence, multiplying both sides of Eq. (4.10) by the row vector wi, we get
+ WtX(2) +
wiw(n + 1) = WtX(1)
*
. . + wtx(n>
(4.12)
Accordingly, in light of the definition given in Eq. (4.1l), we have
wiw(n
+ 1) 2 na
(4.13)
Next, we make use of an inequality known as the Cauchy-Schwarz inequality. Given two vectors wo and w(n + l), the Cauchy-Schwarz inequality states that
llwol1211w(n+ 1)Ip 2 [wiw(n + 1)12
(4.14)
where /(*I1denotes the Euclidean norm of the enclosed argument vector, and the inner product w;w(n 1) is a scalar quantity. We now note from Fiq. (4.13) that [wiw(n 1)12 is equal to or greater than n2a2.Moreover, from Eq. (4.14) we note that Ilwol1211w(n 1)112 is equal to or greater than [w;w(n 1)12. It follows therefore that
+
+
+
Ilwo(1211w(n+ 1)y
+
2 n2a2
or, equivalently,
(4.15) Next, we follow another development route. In particular, we rewrite Eq. (4.9) in the form
w(k
+ 1) = w(k) + x(k)
fork
=
1,
. . . , n and x(k) E XI
(4.16)
Hence, taking the squared Euclidean norm of both sides of Eq. (4.16), we get
Ilw112 + llX(k)1l2 + 2wT(k)x(k)
(4.17)
But, under the assumption that the perceptron incorrectly classifies an input vector x(k) belonging to the subset X,, we have wT(k)x(k)< 0. We therefore deduce from Eq. (4.17) that
Ilw@ + 1)IP
5
llW(W + Ilx(k)l12
4.3 / The Perceptron Convergence Theorem 111
or, equivalently, Ilw
- l l ~ ( k ) (5( ~llx(k)112,
k
=
1, . . . ,
(4.18)
Adding these inequalities for k = 1, . . . , n, and assuming that the initial condition w(0) = 0, we get the following condition:
c llx(k>ll" n
lIw(n + 1)1125
k=l
5 np
(4.19)
where P is a positive number defined by
P=
Iw l 112
(4.20)
Equation (4.19) states that the squared Euclidean norm of the weight vector w(n + 1) grows at most linearly with the number of iterations n. Clearly, the second result of Eq. (4.19) is in conflict with the earlier result of Eq. (4.15) for sufficiently large values of n. Indeed, we can state that n cannot be larger than some value n,, for which Eqs. (4.15) and (4.19) are both satisfied with the equality sign. That is, n,, is the solution of the equation
Solving for nmx for a solution vector wo, we find that (4.21) We have thus proved that for q(n) = I for all n, and w(0) = 0, and given that a solution vector wo exists, the rule for adapting the synaptic weights connecting the associator units to the response unit of the perceptron must terminate after at most n,, iterations. Note also from Eqs. (4.11), (4.20), and (4.21) that there is no unique solution for wo or nm,. We may now state the Jixed-increment convergence theorem for a single-layer perceptron as follows (Rosenblatt, 1962): Let the subsets of training vectors X , and X , be linearly separable. Let the inputs presented to the single-layer perceptron originate from these two subsets. The perceptron converges after some no iterations, in the sense that
w(no) = w(no + 1) = w(no + 2) = . . . is a solution vector for no 5 n,,
.
Consider next the absolute error-correction procedure for the adaptation of a singlelayer perceptron, for which q(n) is variable. In particular, let q(n) be the smallest integer for which rp(W(n)x(n> > lwWx(n)l With this procedure we find that if the inner product w'(n)x(n) at iteration n has an incorrect sign, then wr(n + l)x(n) at iteration n + 1 would have the correct sign. This suggests that if w'(n)x(n) has an incorrect sign, we may modify the training sequence at 1) = x(n). In other words, each pattern is presented iteration n + 1 by setting x(n repeatedly to the perceptron until that pattern is classified correctly. Note also that the use of an initial value w(0) different from the null condition merely results in a decrease or increase in the number of iterations required to converge, depending
+
112 4 / The Perceptron
on how w(0) relates to the solution wo. Hence, regardless of the value assigned to w(O), the single-layer perceptron of Fig. 4.1 is assured of convergence.
Summary In Table 4.1 we present a summary of the perceptron convergence algorithm (Lippmann, 1987). The symbol sgn(-), used in step 3 of the table for computing the actual response of the perceptron, stands for the signum finction: sgn(u) =
+1
ifu>0
-1
ifu
(4.22)
TABLE 4.1 Summary of the Perceptron Convergence Algorithm
Variables and Parameters x(n) = ( p + 1)-by-1 input vector = [- 1, xdn), xz(n),
w(n) = ( p
. ..
1
x,(n)l'
+ 1)-by-1weight vector
= [B(n),w1(n), w&),
. . . ,wp(n>lT
B(n) = threshold y(n) = actual response (quantized) d(n) = desired response
77 = learning-rate parameter, a positive constant less than unity
Step 7: Initialization Set w(0) = 0. Then perform the following computations for time n
=
1, 2, . . . .
Step 2: Activation At time n, activate the perceptron by applying continuous-valued input vector x(n) and desired response d(n).
Step 3: Computation of Actual Response Compute the actual response of the perceptron:
y(n) = sgn[w'(n)x(n)l where sgn(-) is the signum function.
Step 4: Adaptation of Weight Vector Update the weight vector of the perceptron: w(n
+
1) = w(n)
+ rl[d(n)- y(n)lx(n)
where d(n) =
{
+1
if x(n) belongs to class
-1
if x(n) belongs to class
Step 5 Increment time n by one unit, and go back to step 2.
4.4 I Performance Measure 113
In effect, the signum function is a shorthand notation for the asymmetric form of hard limiting. Accordingly, the quantized response y(n) of the perceptron is written in the compact form
An)
=
sgn(w'(n)x(n))
(4.23)
Note that the input vector x(n) is a ( p + 1)-by-1 vector whose first element is fixed at - 1 throughout the computation in accordance with the definition of Eq. (4.3). Correspondingly, the weight vector w(n) is a ( p + 1)-by-1 vector whose first element equals the threshold q n ) , as defined in Eq. (4.4). In Table 4.1 we have also introduced a quantized desired response d(n), defined by d(n) =
+1
if x(n) belongs to class
-- 1
if x(n) belongs to class %*
(4.24)
Thus, the adaptation of the weight vector w(n) is summed up nicely in the form of an error-correction learning rule, as shown by w(n
+ 1) = w(n) + q[d(n)
-
y(n)]x(n)
(4.25)
where q is the Zeaming-rate parameter, and the difference d(n) - y(n) plays the role of an error signal. The learning-rate parameter is a positive constant limited to the range 0 < q 5 1. In assigning a value to it inside this range, we have to keep in mind two conflicting requirements (Lippmann, 1987): Averaging of past inputs to provide stable weight estimates, which requires a small q Fast adaptation with respect to real changes in the underlying distributions of the process responsible for the generation of the input vector x, which requires a large 77
4.4 Performance Measure In the previous section we derived the perceptron convergence algorithm without any reference to a performance measure, yet in Chapter 2 we emphasized that improvement in the performance of a learning machine takes place over time in accordance with a performance measure. What then is a suitable performance measure for the single-layer perceptron? An obvious choice would be the auerage probability of classijication error, defined as the average probability of the perceptron making a decision in favor of a particular class when the input vector is drawn from the other class. Unfortunately, such a performance measure does not lend itself readily to the analytic derivation of a learning algorithm. Shynk (1990) has proposed the use of a performance function that befits the operation of a perceptron, as shown by the statistical expectation J
=
-E[e(n)u(n)]
(4.26)
where e(n) is the error signal defined as the difference between the desired response d(n) and the actual response y(n) of the perceptron, and v(n) is the linear combiner output (Le., hard-limiter input). Note that the error signal e(n) is not a quantized signal, but rather the difference between two quantized signals. The instantaneous estimate of the performance function is the time-varying function S(n) = -e(n>v(n) =
- y(n>lv(n)
(4.27)
114 4 I The Perceptron
The instantaneous gradient vector is defined as the derivative of the estimate &n) with respect to the weight vector w(n), as shown by (4.28) Hence, recognizing that the desired response d(n) is independent of w(n) and the fact that the actual response y(n) has a constant value equal to - 1 or + 1, we find that the use of Eq. (4.27) in (4.28) yields2 (4.29)
For the defining equation (4.5), we readily find that3 (4.30) Thus the use of Eq. (4.30) in (4.29) yields the result
vw.m
=
-[d(n) - y(n)lx(n)
(4.31)
Hence, according to the error-correction rule described in Chapter 2 we may now express the change applied to the weight vector as
A w ( ~ )= - qV,.T(n) = q[d(n)- y(n>lx(n)
(4.32)
To be precise, we should also include a Dirac delta function (Le., impulse function) whenever the perceptron output y(n) goes through a transition at u(n) = 0. Averaged over a long sequence of iterations, however, such a transition has a negligible effect. Consider a scalar quantity u defined as the inner product of two real-valued p-by-1 vectors w and x, as shown by P
v
= WTX =
2 wix, i=l
where wi and xi are the ith elements of the vectors w and x, respectively. The partial derivative of the quantity v with respect to the vector w is defined by (Haykin, 1991)
I::[
auiaw,
av -=
aw
From the given definition of u , we note that
av- - x,, awi
We therefore have
i = 1 , 2 , . . . ,p
4.5 / Maximum-LikelihoodGaussian Classifier 115
where r ) is the learning-rate parameter. This is precisely the correction made to the weight vector as we progress from iteration n to n -I- 1, as described in Eq. (4.25). We have thus demonstrated that the instantaneous performance function j ( n ) defined in Eq. (4.27) is indeed the correct performance measure for a single-layer perceptron (Shynk, 1990).
4.5 Maximum-Likelihood Gaussian Classifier A single-layer perceptron bears a close relationship to a classical pattern classifier known as the maximum-likelihood Gaussian ClassiJier,in that they are both examples of linear classiJiers (Duda and Hart,1973). The maximum-likelihood method is a classical parameter-estimation procedure that views the parameters as quantities whose values are@ed but unknown. The best estimate is defined to be the one that maximizes the probability of obtaining the samples actually observed. To be specific, suppose that we separate a set of samples according to class, so that we have M subsets of samples denoted by XI, X,, . . . , X,. Let the samples in subset Xjbe drawn independently according to a probability law of known parametric form. Let this probability law be described by the conditional probability densityfunction f(xlwj),where x is an observation vector and wjrepresents an unknown parameter vector associated with class T j ,j = 1, 2, . . . ,M . We further assume that the samples in X , give no information about wj if i # j , that is, the parameters for the different classes are functionally independent. Now, viewed as a function of w,f(x)w) is called the likelihood of w with respect to the observation vector x. The maximum-likelihood estimate of w is, by definition, that value ii. which maximizes the likelihood f(x)w). To proceed with the issue at hand, consider a p-by-1 observation vector x defined by
x=
[XI, x2,
.. . ,Xp]T
(4.33)
Note that the vector x as defined here is confined exclusively to the externally applied inputs. The observation vector x is described in terms of a mean vector p and covariance matrix C,which are defined by, respectively, p = E[xl
(4.34)
and
c=
-
p)(x
-
PYI
(4.35)
where E is the expectation operator. Assuming that the vector x is Gaussian-distributed, we may express the joint-probability density function of the elements of x as follows (Van Trees, 1968): (4.36) where det C is the determinant of the covariance matrix C,the matrix C-'is the inverse of C,and p is the dimension of the vector x. To be more specific, suppose that we have a two-class problem, with the vector x characterized as follows, depending on whether it belongs to class or class %,:
x EX,:
meanvector = p1 covariance matrix = C
x EX,:
mean vector = p2 covariance matrix = C
116 4 / The Perceptron
In other words, the vector x has a mean vector the value of which depends on whether it belongs to class %1 or class (e2,and a covariance matrix that is the same for both classes. It is further assumed that: rn The classes % l and (e2 are equiprobable. rn The samples of class
or (e2 are correlated, so that the covariance matrix C is nondiagonal. rn The covariance matrix C is nonsingular, so that the inverse matrix C-' exists. We may then use Eq. (4.36) to express the joint-probability density function of the input vector x as follows: (4.37) where the index i = 1, 2. Taking the natural logarithm of both sides of Eq. (4.37) and then expanding terms, we get P 1 1 lnf(xl%J = - - ln(2n) - - ln(det C) - - xTC-'x 2 2 2
.
+ prC-'x - -21 pTC-'pl
(4.38)
The first three terms on the right-hand side of Eq. (4.38) are independent of the index i. Hence, for the purpose of pattern classification considered here, we may define a loglikelihood as follows: 1 &(X)= pTC-lx - - pTC-'pi 2
where i
=
(4.39)
1, 2. For class TI, the log-likelihood is
1 ll(X) = p p x - 2 pfC-'pl
(4.40)
For class %2, the log-likelihood is 1 &(x)= p p x - p;c-1p,
5
(4.41)
Hence, subtracting Eq. (4.41) from (4.40), we get 1 = &(x)
-
= (p1-
Iz(x)
1 p2)Tc-1x- - ( p p - l p ' - p2TC-'pz) 2
(4.42)
which is linearly related to the input vector x. To emphasize this property of the composite log-likelihood 1, we rewrite Eq. (4.42) in the equivalent form 1 = fiTx - 8 =
i:
*jxj -
i= 1
e
(4.43)
where fi is the maximum-likelihood estimate of the unknown parameter vector w, defined by w = c-yp.1- p*) (4.44) I
and
8 is a constant threshold, defined by 1
8 = -2 (pfC-1p1-
p;c-'p2)
(4.45)
4.5 I Maximum-Likelihood Gaussian Classifier 117
The maximum-likelihood Gaussian classiJer or Gaussian classijier (for short) may thus be implemented by a linear combiner as shown in the signal-flow graph of Fig. 4.5. Equations (4.44) and (4.45) simplify considerably for the following special case. All the samples of the vector x are uncorrelated with each other, and also have a common regardless of the associated class; that is, the covariance matrix C is a diagonal variance uz, matrix defined by
c = u2I
(4.46)
where I is the p-by-p identity matrix. We now have 1 C-'= -1
(4.47)
u 2
Accordingly, Eqs. (4.44) and (4.45) simplify as, respectively,
1 % = - ((+2 11.1
-
11.2)
(4.48)
and (4.49) where )).I)denotes the Euclidean norm of the enclosed vector. In any case, treating 1 as the overall classifier output, the Gaussian classifier, characterized by the parameter vector % and threshold 8, solves the pattern-classification problem in accordance with the following rule: If 1 2 0, then 1, 2 lZand therefore assign x to class (el If 1 < 0, then lZ> lIand therefore assign x to class (ez
(4.50)
The operation of the Gaussian classifier is analogous to that of the single-layer perceptron in that they are both linear classifiers; see Eqs. (4.2) and (4.43). There are, however, some subtle and important differences between them, which should be carefully noted as discussed here: w
The single-layer perceptron operates on the premise that the patterns to be classified are linearly separable. The Gaussian distribution of the two patterns assumed in the derivation of the maximum-likelihood Gaussian classifier do certainly overlap each other and are therefore not exactly separable; the extent of the overlap is determined by the mean vectors pl and p2,and the covariance matrices C1 and Cz. The nature of this overlap is illustrated in Fig. 4.6 for the special case of a scalar input x (i.e., input dimension p = 1). When the inputs are nonseparable and their distributions overlap as described here, the perceptron convergence algorithm develops a problem
FIGURE 4.5
Signal-flow graph of Gaussian classifier.
118 4 I The Perceptron
Decision boundary
X
% FIGURE 4.6
%
Two overlapping, one-dimensional Gaussian distributions.
in that the decision boundaries between the different classes may oscillate continuously (Lippmann, 1987). The maximum-likelihood Gaussian classifier minimizes the average probability of classification error. This minimization is independent of the overlap between the underlying Gaussian distributions of the two classes. For example, in the special case illustrated in Fig. 4.6, the maximum-likelihood Gaussian classifier always positions the decision boundary at the point where the two Gaussian dstributions cross each other. This positioning of the decision boundary is perfectly in accord with the optimum Bayes 's hypothesis-testing procedure on the assumption that the two classes are equiprobable (ie., there is no prior information to be incorporated into the decision making). The perceptron convergence algorithm is nonparametric in the sense that it makes no assumptions concerning the form of the underlying distributions; it operates by concentrating on errors that occur where the distributions overlap. It may thus be more robust than classical techniques, and work well when the inputs are generated by nonlinear physical mechanisms, and whose distributions are heavily skewed and non-Gaussian (Lippmann, 1987). In contrast, the maximum-likelihood Gaussian classifier is parametric; its derivation is contingent on the assumption that the underlying distributions are Gaussian, which may limit its area of application. The perceptron convergence algorithm is both adaptive and simple to implement; its storage requirement is confined to the set of synaptic weights and threshold. On the other hand, the design of the maximum-likelihood Gaussian classifier is fixed; it can be made adaptive, but at the expense of increased storage requirement and more complex computations (Lippmann, 1987).
4.6 Discussion The study of the single-layer perceptron presented in this chapter has been formulated in the context of the McCulloch-Pitts formal model of a neuron. The nonlinear element of this model is represented by a hard limiter at its output end. It would be tempting to think that we can do better if we were to use a sigmoidal nonlinear element in place of the hard limiter. Well, it turns out that the steady-state, decision-making characteristics of a single-layer perceptron are basically the same, regardless of whether we use hard-limiting or soft-limiting as the source of nonlinearity in the neural model (Shynk and Bershad,
4.6 I Discussion 119
1991, 1992; Shynk, 1990). We may therefore state formally that so long as we limit ourselves to the model of a neuron that consists of a linear combiner followed by a nonlinear element, then regardless of the form of nonlinearity used, a single-layer perceptron can perform pattern classification only on linearly separable patterns. Linear separability requires that the patterns to be classified must be sufficiently separated from each other to ensure that the decision surfaces consist of hyperplanes. This requirement is illustrated in Fig. 4.7a for the case of a two-dimensional, single-layer perceptron. If now the two patterns TI and (e2 are allowed to move too close to each other, as in Fig. 4.7b, they become nonlinearly separable and the elementary perceptron fails to classify them. The first real critique of Rosenblatt’s perceptron was presented by Minsky and Selfridge (1961). Minsky and Selfridge pointed out that the perceptron as defined by Rosenblatt could not even generalize toward the notion of binary parity, let alone make general abstractions. They also suggested the roles that connectionist networks might play in the implementation of larger systems. The computational limitations of Rosenblatt’ s perceptron were subsequently put on a solid mathematical foundation in the famous book by Minsky and Papert (1969, 1988). This is a thorough and well-written book. After the presentation of some brilliant and highly detailed mathematical analysis of the perceptron, Minsky and Papert proved that the perceptron as defined by Rosenblatt is inherently incapable of making some global generalizations on the basis of locally learned examples. In the last chapter of their book, Minsky and Papert go on to make the conjecture that the limitations of the kind they had discovered for Rosenblatt’s perceptron would also hold true for its variants, more specifically, multilayer neural networks. Quoting from Section 13.2 of the book by Minsky and Papert (1969):
( b)
FIGURE 4.7 patterns.
(a) A pair of linearly separable patterns. (b) A pair of nonlinearly separable
120 4 / The Perceptron
The perceptron has shown itself worthy of study despite (and even because of !) its severe limitations. It has many features to attract attention: its linearity; its intriguing learning theorem; its clear paradigmatic simplicity as a kind of parallel computation. There is no reason to suppose that any of these virtues carry over to the many-layered version. Nevertheless, we consider it to be an important research problem to elucidate (or reject) our intuitive judgement that the extension to multilayer systems is sterile.
This conjecture cast serious doubt on the computational capabilities of not only the perceptron but neural networks in general up to the mid-1980s. History has shown, however, that the conjecture made by Minsky and Papert seems to be unjustified in that we now have several advanced forms of neural networks that are computationally more powerful than Rosenblatt’s perceptron. For example, multilayer perceptrons trained with the back-propagation algorithm discussed in Chapter 6 and the radial basis-function networks discussed in Chapter 7 overcome the computational limitations of the single-layer perceptron.
4.1 The single-layer perceptron is a linear pattern classifier. Justify this statement.
4.2 Consider two one-dimensional, Gaussian-distributed classes
and %z that have a
common variance equal to 1. Their mean values are as follows: p1 = -10 pz = 4-10
These two classes are essentially linearly separable. Design a classifier that separates these two classes.
4.3 Verify that Eqs. (4.22)-(4.25), summarizing the perceptron convergence algorithm, are consistent with Eqs. (4.7) and (4.8).
4.4 Equations (4.44) and (4.45) define the weight vector and threshold of a maximumor %z are likelihood Gaussian classifier, assuming that the samples of classes correlated. Suppose now that these samples are uncorrelated but have different variances, so that for both classes we may write C = diag[cr:,
(T;,
. . . ,cr3
Find the weight vector and threshold of the classifier so defined.
4.5 In Eq. (4.27) we defined one form of an instantaneous performance function for a I single-layer perceptron. Another way of expressing this performance measure is as follows (Shynk, 1990): j(n)
=
Iu(n)l - d(n) u(n)
where u(n) is the hard-limiter input, and d(n) is the desired response. Using this performance measure, find the instantaneous gradient vector V,j(n), and show that the expression so derived is identical with that in Eq. (4.31). 4.6 The perceptron may be used to perform numerous logic functions. Demonstrate the implementation of the binary logic functions AND, OR, and COMPLEMENT. However, a basic limitation of the perceptron is that it cannot implement the EXCLUSIVE OR function. Explain the reason for this limitation.