NEURAL NETWORKS STUDY NOTES. Chapter 1 This Chapter covers the principle concepts of pattern recognition. It deals with: Polynomial curve fitting Parameter optimization Generalization Model complexity Pattern recognition ->Probability, Decision criteria, Bayes theorem
1.1 Character recognition Pattern recognition is synonymous with information processing problems. An approach will be developed based on sound theoretical concepts. Information is normally probabilistic in nature and a statistical framework provides means of both operation on the data and representation of results. Consider image data represented by a vector x=(x1,….xd)T where d is the total number of variables. This image has to be classified as either class 1 (C1) or class 2 (C2) up to k, where k is the number of classes. The aim is to develop an algorithm or system to classify these sets of data. To do this, a collection of known data is collected for each class (data set or sample) This is then used in the development of a classification algorithm. A collection of available data is known as a training set. This normally may be less than the total number of possible variations in a particular class. The algorithm, should be able to correctly classify data which was not used in the training set- generalization . e.g. for d-dimensional space, d=20, a four bit representation would mean 220 x 4 elements in a sample. Too big a number. To reduce this, data can be combined into features. This reduces the ‘data set’ considerably. These could consist of ratios, averages etc depending on the problem. This can then be used to determine a decision boundary or threshold, provided there is a marked distinction between the classes for this particular feature. e.g. elements of class C2 have bigger values of feature x than elements of class C1. Overlaps may be there. See (fig 1.2. )
Fig 1.1 fig 1.2 Misclassifications are inevitable. However, these can be reduced by increasing the number of features which too, should not exceed a certain limit. Better decision boundaries can be obtained with more features.
1.2 Classification and regression The overall system maps a set of input variables (from which features are extracted) to an output variable y. For several outputs yk where k=1..c. c= number of outputs. This cannot be done with a set of training data. This data is fed into some mapping function with adjustable weights. yk = yk(x:w), where w is a vector/matrix of parameters or weights. A neural network is a particular choice of a set of functions(or fuctional forms) yk(x:w),, together with procedures to optimize the mapping. In classification problems we seek to approximate probabilities of membership of different classes expressed as functions of input variables. Apart from classification, another pattern recognition task is regression. In regression, future and unknown values are predicted given past values. The output represents the value of a continuous variable. More like, given a data set, the relationship between the data is determined i.e. the function connecting the data. This approximated function is then used to determine other values not present in a data set. In regression, it is the regression function which is approximated (see example) A classical definition of regression A term coined by Galton for the tendency of the quantitative traits of offspring to be closer to the population mean than are their parents' traits.
It arises from a combination of factors - dominance, gene interactions, and environmental influences on traits.
1.3 Pre-processing and feature extraction The sequence of operations on data is: feature extraction.A fixed transformation of variables. contains adaptive parameters converting to required form Some definitions Translational invariance- a classification system whose decisions are insensitive to location of object (data) in an image(space). Scale invariance - a feature does not depend on size of the object Prior knowledge - information known about the desired form of the solution
1.4 The curse of dimensionality
X21= 2 d
Increasing features can improve the performance of the classification system however, beyond a critical point, more features result in the reduction of performance of the classification system. Imagine two input variables x1 and x2(d=2) if these are divided into M divisions, the number of cells increases by Md
M=2, M=3 Md=22=4, Md=32=9
Fig 1.3
The number of cells grows exponentially with the dimensionality of the input space. This means the training data must increase by at least the same amount. That’s the curse. -Since data is somewhat correlated it tends only to fill a subspace of lower dimensionality –intrinsic dimensionality. -outputs do not arbitrarily change from one region of input space to another. Therefore, output values at intermediate points can be inferred like in interpolation. These properties enable reduction of inputs possible with the aim of reducing dimensionality. Polynomial curve fitting We are trying to fit a polynomial of order M to N data points yx:w=w0x0+w1x1…+wMxM=j=0Mwjxj Parameters wj are determined by minimizing the error function
E=12n=1Nyxn;w-tn2 , where tn is the desired output (it can be xn or some other ‘prior’ value) and xn is the nth data point. Supervised learning – learning which involves some known target value Unsupervised learning – no target. The goal is not an input output mapping but to model the probability distribution of the data or some other inherent structures. Reinforcement learning –information is supplied to determine the quality of the output But no actual values are given. A system must be able to generalize well to cater for noise. Test datais used to determine a system’s generalization capabilities. Test data is just trained data (similarly generated) with noise. The order of the polynomial is called degree of freedom. Curve fitting is improved with increase inthe degrees of freedom. Up to a certain point. The curve should not fit the data exactly. That would be over-fitting which would give a poor representation of the original underlying function.
1.5 Generalization ERMS=1Nn=1Nyxn;w*-tn2 This is known as the root mean square function. Where w* denotes the best set of weights.(minimum error). Example. For a given function hx=0∙5+0∙4sin2πx
Fig 1.6 M=1 bad fit
Fig 1.7
M=3, good fit.
Fig 1.8. M=10, over-fitting can be used to determine best order
Fig 1.9 Test set M=3 is the
minimum For 2D data
Fig 1.11 good boundary boundary
Fig 1.12 over-fitted
A model with little flexibility has a high bias. A model with too much flexibility has a high variance.
1.6 Model complexity As can be seen above (and from Occan’s razor) simple models are better than complex ones. But they should not be too simple, M=1. Another control its effective complexity approach apart from the one mentioned above is, introducing a penalty Ω(regularization term) to the error function. Ê=E +vΩ , v determines how much Ω influences the solution.
Ω increases with complexity (curvature). Ω=12d2ydx22dx, therefore after substitution, Ê=12n=1Nyxn;w-tn2 +v12d2ydx22dx This helps reduce both bias and variance. The limit is the amount of noise.
1.7 Multivariate non- linear functions Mapping polynomials can be extended to higher dimensions. E.g. for d input variables, and 1 output variable, it could be chosen to represent the mapping as y=w0+i1=1dwi1xi1+i1=1di2=1dwi1i2xi1xi2+i1=1di2=1di3=1dwi1i2i3xi1 xi2xi3 However the number of independent parameters would grow as dM.This would need lots of training data. The importance of neural networks is how they deal with the problem of scaling and dimensionality. These models represent non-linear functions of many variables in terms of superpositions of non-linear functions of a single variable. These are known as hidden functions and are adapted to the data. We shall consider the multi-layer perceptron and radial basis function network. The error falls as O(1/M) where Mist the number of hidden networks. For polynomials decrease as O(1/M2/d).However, this is computationally intensive and has the problem of multiple minima in the error function.
1.8 Bayes’ Theorem Without knowing about the features or characteristics of the data, prior probabilities can be determined from a set of data. This is in cases of classification of data. The fraction of a class in the total number of data samples is considered to be the prior probability. e.g. A sample of letters contains 3 ’ A’s , 4 ‘B’s, we can say, the prior probabilities of A and B are P(A),3/7 and P(B),4/7 respectively. This means , every new character would be assigned as ‘B’ since it has a higher prior probability. If more 2 characters were sampled, and their prior probabilities were as follows A-3/11, B-4/11, C-4/11 then it means each new character would be between C and B. This means that prior probabilities are insufficient means of classification. However, they have to be taken into consideration. Introducing other things to criteria for classification is a necessity. Here is where features of the data come in. Some definitions Joint probability l and belongs
P(Ck,Xl) - probability that the object has feature value X to class C k. , where k is 1,2…n. n is the total number
of classes. l is the
total number of features. Conditional probability P(Xl|Ck) - The probability that the observation is feature X l seeing that it belongs to class C k. It is a measure of the predominance of a feature by a particular class. How much this feature is synonymous with a certain class. Example. Consider Two classes C1 and C2 and feature X
X1
X2
Total
C1
19
41
60
C2
12
28
40
31
69
100
Total
Feature
l
X1
X2
Row C1= 60, row C2 = 40. C1 = 60. Total Samples =40 + 60. P(C1) = 60Total Samples= 60100 {Probability of a randomly selected variable belonging to C1}
P(X 2) = 69100 {probability that of finding a randomly selected feature(X 2) is feature in the sample}
P(C1|X 2) = 4169 from the definition. It can also be said that P(X 2|C1) = 4160 P(X 2 |C1) is the probability that a given class is C1 seeing that it falls in feature X 2. It is also known as the class conditional probability of X
2
for class C1.
P(C1 ,X 2) = P(X 2|C1) P(C1) = 4160×60100 = 0.41
OR
P(C1 ,X 2) = P(C1|X 2) P(X 2) =
4169×69100 =0.41
It can also be clearly seen from the table that the (joint) probability of a sample having
feature X 2 and belonging to class C1 is 41 out of 100 samples = 0.41 Since P(X 2|C1) P(C1) = P(C1|X 2) P(X 2), then PC1|X2=PX2|C1PC1PX2 the “conditional probability” on the left is known as the posterior probability . The above is known as Baye’s theorem An Excerpt from the web
Let's use the same example, but shorten each event to its one letter initial, ie: A, B, C, and D instead of Aberations, Brochmailians, Chompieliens, and Defective. P(D|B) is not a Bayes problem. This is given in the problem. Bayes' formula finds the reverse conditional probability P(B|D). It is based that the Given (D) is made of three parts, the part of D in A, the part of D in B, and the part of D in C. P(B|D) =
P(B and D) ----------------------------------------P(A and D) + P(B and D) + P(C and D)
Inserting the multiplication rule for each of these joint probabilities gives P(B|D) =
P(D|B)*P(B) ----------------------------------------P(D|A)*P(A) + P(D|B)*P(B) + P(D|C)*P(C)
However, and I hope you agree, it is much easier to take the joint probability divided by the marginal probability. The table does the adding for you and makes the problems doable without having to memorize the formulas. Company
Defective
Total
(A) Aberations
0.50-0.025 = 0.475 0.05(0.50) = 0.025
0.50
(B)
Brochmailians
0.30-0.021 = 0.279 0.07(0.30) = 0.021
0.30
Chompieliens
0.20-0.020 = 0.180 0.10(0.20) = 0.020
0.20
(C )
Good
Total
0.934
0.066
1.00
N.B the marginal probability is the P(D) (= 0.66 )in this case. P(C1) and P(C2) are marginal probabilities. Note that the total of the marginal probabilities adds to 1.00. Whereas, P(A and D) + P(B and D) + P(C and D)= 0.025 + 0.021 + 0.020= 0.66 = P(D)
Thus, it can be said that the denominator acts as a normalizing factor. i.e. the conditional probabilities all add up to unity. Therefore , PC1|X2+PC2|X2=1 From the excerpt, PBD+PCD+PAD= P(B,D)P(D)+P(C,D)P(D)+P(A,D)P(D) = 0.0210.66+0.0200.66+0.0250.66=1
N.B. P(D) ensures that the sum of conditional probabilities tends to unity. Without it, the sum would be 0.66 It is a NORMALIZING FACTOR
From the Example, it is clear that, P( C1 , X2) + P( C2 , X2) = P(X2) {C1 and X2} These are,
41%
+ 28 %
{C2 and X2} = 69 %
The relation above, P( C1 , X2) + P( C2 , X2) = P(X2), can also be written as P(X2)= P(X2|C1)P(C1) + P(X2|C2)P(C2) P(X2) has been expressed in terms of the prior probability and class conditional probability. 1.8.1 Inference and decision The importance of Baye’s theorem is that posterior probabilities can be calculated and expressed using easily obtainable quantities. In the examples, these are the values in the tables which are directly obtained from the data. Classification of an object with a particular feature or feature value, X done by assigning
l
is
the object to class Ck for which the posterior probability P(Ck |X l ) is greatest. Baye’s rule. In practice, care must be taken in data collection to avoid large disparities between our expectations (calculated posterior probabilities) and reality. Prior probabilities must be correctly accounted for. Implementation of Baye’s theorem would meanevaluating the class conditional and prior probabilities separately, then using Baye’s theorem to evaluate posterior probabilities. Outputs of neural networks can be interpreted as posterior probabilities provided that the error function is chosen appropriately.
1.8.2 Bayesian versus frequentist statistics
We assumed that the samples or observations tend to infinity. This is a frequentist view of probabilities. Usually, probabilities express our degree of belief that a particular event will occur. Baye’s theorem, is a precise quantitative way to update these values when new data is presented.
1.8.3. Probability densities Feature variables can be regarded as continuous. The notation is as follows. Upper case denotes probability while lower case denotes probability densities. Px∈[a,b]=abpxdx
1.14
px is normalized so that Px∈a.b=1 A more general form for a region r in space is: Px∈R=Rpxdx
1.15
Average or expectation of a particular function Q(x) is: εQ=Qxp(x)dx 1.16 For a finite set of data points, εQ=Qxp(x)dx≅ 1Nn=1NQ(xn)
1.17
1.8.4 Baye’s theory in general
This is to incorporate the probability densities. PCkx=pxCkP(Ck)p(x)
1.21 px=k=1cpxCkPk
1.22
k=1cPxCk=1
1.23 When class-conditional OR posterior probabilities are viewed as parameterized functions, they are referred to as likelihood functions. Baye’s theorem can be minimized to:
posterior=likelihood X priornormalization factor
1.24
1.9 Decision Boundaries
1.10 Minimizing risk