Pattern Recognition 34 (2001) 2459}2466
Automatic extraction of eye and mouth "elds from a face image using eigenfeatures and multilayer perceptrons Yeon-Sik Ryu, Se-Young Oh* Department of Electrical Engineering, Intelligent Systems Laboratory, Pohang University of Science and Technology (POSTECH), Pohang, KyungBuk 790-784, South Korea Received 22 September 1999; received in revised form 18 September 2000; accepted 6 October 2000
Abstract This paper presents a novel algorithm for the extraction of the eye and mouth (facial features) "elds from 2-D gray-level face images. The fundamental philosophy is that eigenfeatures, derived from the eigenvalues and eigenvectors of the binary edge data set constructed from the eye and mouth "elds, are very good features to locate these "elds e$ciently. The eigenfeatures extracted from the positive and negative training samples of the facial features are used to train a multilayer perceptron whose output indicates the degree to which a particular image window contains an eye or a mouth. It turns out that only a small number of frontal faces are su$cient to train the networks. Furthermore, they lend themselves to good generalization to non-frontal pose and even other people's faces. It has been experimentally veri"ed that the proposed algorithm is robust against facial size and slight variations of pose. 2001 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Facial feature; Eye and mouth "elds; Eigenfeature; Multilayer perceptron; Positive (Negative) sample
1. Introduction It is known that viewing a person's eyes and mouth is essential in understanding the information and feeling they convey. Because of this, automatic extraction of eyes and mouths from a person's face can be very useful in many applications. Many researchers have proposed methods to "nd the eye and mouth regions [1}3] or to locate the face region [4}8] in an image. These methods can be classi"ed by their use of three types of information: template matching, intensity and geometrical features. In general, template matching requires many templates to accommodate varying pose whereas the intensity method requires good lighting conditions. In his pioneering work, Kanade [5] used a Laplacian operator to "nd an edge image. Since the Laplacian operator does not give information on edges, Brunelli [7]
* Corresponding author. Tel.: #82-54-279-2214; fax: #8254-279-2903. E-mail address:
[email protected] (S.-Y. Oh).
used horizontal gradients to detect the head top, eyes, nose base and mouth. He found the location of eyes using template matching. However, it requires many templates to correctly locate the eyes while accommodating varying pose. Beymer [6] used a "ve-level hierarchical system to locate the eyes and nose lobes. He used many facial feature templates covering di!erent poses and di!erent people. Juell et al. [8] proposed a neural network structure to detect human faces in an image. They used three child-level neural networks to locate the eyes, nose and mouth. But they used enhanced gray images as input to the neural networks, thus requiring too many input neurons and training samples to work well with various poses. In this paper, we present a novel algorithm for the extraction of the eye and mouth "elds from 2-D graylevel face images. The fundamental philosophy is that eigenfeatures, derived from the eigenvalues and eigenvectors of the binary edge data set constructed from the eye and mouth "elds, are very good features to locate these "elds. The eigenfeatures are extracted from the positive and negative training samples of the facial features and
0031-3203/01/$20.00 2001 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 0 0 ) 0 0 1 7 3 - 4
2460
Y.-S. Ryu, S.-Y. Oh / Pattern Recognition 34 (2001) 2459}2466
are used to train a multilayer perceptron (MLP) whose output indicates the degree to which a particular image window contains the complete portion of eyes or the mouth within itself. The window template within a test image is then shifted in image space and the one with the maximum output for the MLP represents the eye or mouth "eld. It turns out that only a small number of frontal faces are su$cient to train the networks since they lend themselves to good generalization to non-frontal pose and even other people's faces. Section 2 describes coarse extraction of the eye and mouth "elds using geometrical constraints. Section 3 presents "ne extraction of the facial features using neural networks. Experimental results are reported in Section 4.
2. Coarse extraction of the eye and mouth 5elds from geometrical constraints The facial images used in this research consist of 92;112 pixels and have been obtained from the Olivetti Research Lab (ORL) database. They consist of frontal views as well as side views containing both eyes and the mouth. The basic assumption in the database is that the eyes and mouth reside around the center of the image against a uniform background. Fig. 1 shows the overall system for extraction of the facial features. Fig. 2 shows some of the original images from which to extract the three facial features of interest. The "rst "ve images of Fig. 2(a) were used to extract the eye and mouth regions while the last was used only to extract the mouth. Note
Fig. 2. The face images used for training and testing the proposed algorithm: (a) the images used to extract the training set for the MLP detector and (b) the images used to extract the test set.
that only frontal images were used to train the networks since otherwise, there would be too many pose variations to consider. Fig. 2(b) shows a portion of the test set used to estimate the generalization performance. 2.1. Extraction of the face region
Fig. 1. The system block diagram.
It was decided not to go through a priori normalization and edge enhancement process to make the eye location and facial size "xed since the locations of the eyes and mouth all di!er according to the facial size and pose. Instead, a binary edge dominance map [7] was used "rst to roughly identify the face region excluding the hair and ears within which the coarse locations of the eyes and mouth are to be found. This limits the size of the coarse regions containing these three features thus easing the computational burden. The horizontal and vertical projection information obtained from the edge map will be used to estimate the approximate locations of the eyes and mouth [7]. From the gray image I(i, j), the vertical and horizontal edge maps I and I are obtained "rst. These are the binary 4# images which have been computed from the original image through the application of the vertical (horizontal) gradients followed by proper thresholding. Then, the vertical and horizontal projections of these maps, shown
Y.-S. Ryu, S.-Y. Oh / Pattern Recognition 34 (2001) 2459}2466
2461
Eq. (4) has been set to allow for enough margin around y to enclose the entire mouth, since otherwise, y could * * pass through the upper, middle or lower part of the lip according to di!erent poses of the face. 2.2. Extraction of the eyes and mouth The following heuristic assumptions are made in order to simplify the process (see Fig. 4(a)): E The eyes remain within the upper half of the face region. Their exact positions, however, can vary according to the height of the forehead and the facial pose. E The mouth remains within the lower half of the face region.
Fig. 3. Determination of a face boundary: (a) vertical edge dominance map: I (i, j); (b) horizontal edge dominance map: 4# I (i, j); (c) vertical integral projection of (a): H (i) and (d) hori T zontal integral projection of (b): H ( j). F
in Fig. 3, are calculated as ,0 H (i)" I (i, j), 1)i)N , (1) T 4# ! H ,! (2) H ( j)" I (i, j), 1)j)N , F 0 G where N "112 and N "92. From these histograms, 0 ! the face region R(x , y , x , y ) is established, where * * 0 & N (3) y "arg max H ( j), 1)j) 0 , & F 3 H N 0 )j)N , y "arg max H ( j)#y, (4) * F 0 2 H N (5) x "arg max H (i), 1)i) ! , * T 2 G N ! )i)N x "arg max H (i), (6) 0 T ! 2 G and R(x , y , x , y ) is the rectangle with (x , y ), (x , y ), (x , y ) and (x , y ) as its four vertices. The search boundary y in Eq. (3) is determined from & the geometrical observation that the hair boundary line exists in the upper region of the guessed face region. y in
In order to locate the eyes, it is necessary to separate the eyebrow from the eye using the information contained in the vertical edge map I . This is facilitated by 4# observing that although the horizontal edge components dominate in both the eye and the eyebrow, the eye also contains more vertical edge components than the eyebrow. Therefore, the horizontal integration of I is 4# performed and the row index E resulting in the maxA imum value is found as 2 ,! H ( j)" I (i, j), 1)j) (y !y ), 4# & FT 3 * G E "arg max H ( j). A FT H
(7)
(8)
Since E represents the row with the strongest vertical A edge component within the region containing the two eyes, the eye region is centered around that row. However, even this statement may not be correct when the eyes are not horizontal due to the changes in pose or the lighting conditions. Therefore, the eye region is tentatively set as the one centered around E with A $15 pixels safety margin to prevent eye or mouth from being cut. The "nal eye region is then found by assuming that the eyes are more likely to reside in the upper or lower eye
Fig. 4. The 3-stage process of determining the left eye, right eye and the mouth regions.
2462
Y.-S. Ryu, S.-Y. Oh / Pattern Recognition 34 (2001) 2459}2466
regions with respect to the center line E . Under this A assumption, the Eye "eld is determined from the horizontal edge components H ( j) de"ned in Eq. (2) as F
R(x , E !5, x , E #10) if A*B, * A 0 A Eye" R(x , E !10, x , E #5) if A(()B, * A 0 A ? R(x , E !10, x , E #10) otherwise, * A 0 A where
(9)
#A #A > A" H ( j), B" H ( j) F F H#A \ H#A and is set to 1.5. The size of the eye "eld thus found becomes (x !x );(15 or 20) and the left and right eye 0 * "elds, Left}Eye and Right}Eye, are obtained by vertically splitting the Eye region in half (Fig. 4(c)). Finally, the mouth region is obtained by searching for strong horizontal edge components. Similar to the eyes, row M with the strongest sum of the horizontal edge A components H ( j) is searched within the coarse mouth F region. Then the mouth "eld is established as follows: M "arg max H ( j), (y !y ))j)y , A F * & * H
(10)
Mouth"R(x , M !15, x , M #15), * A 0 A
(11)
with the size of the mouth "eld being (x !x );30 0 * (Fig. 4(c)).
3. Fine tuning of the eye and mouth 5elds using neural networks In general, two approaches exist to search for objects: (1) matching against the stored database and (2) eliminating non-matching regions [9]. One way to combine these approaches is the use of positive and negative samples for training. The positive samples contain the complete objects whereas the negative samples do not. In this research, the positive samples are the eye and mouth regions extracted from the frontal faces while the
negative samples are obtained from the surrounding areas of the positive samples. Fig. 5 shows these examples. The eye window is 20;10 and the mouth window is 40;20. The facial feature extractor consists of 3 MLPs for the left and right eyes along with the mouth. The MLPs are then trained with a resilient backpropagation algorithm [10] to produce a linear output whose value represents the degree of similarity to the positive samples. That is, the output is #1 for positive and !1 for negative samples. Physically, the output will be #1 if the candidate window contains the full set of the relevant facial features. 3.1. Eigenfeature extraction The input and output representation has a major in#uence on the neural network performance [11]. Hence, it would be more e$cient to present relevant features rather than the raw image in order to reduce dimensionality as well as to facilitate training of the network. Fig. 6 shows the binary edge map representing a 2-D elliptic data distribution in (u, v) space. The edge pixels in a search window may be considered as a 2-D pattern in (u, v). The distribution of these pixels for the eye and mouth regions is distinctive in nature compared to other regions of the face. Herein, the Canny edge extractor was used to precisely locate the eyes while the Sobel horizontal operator was used for the mouth. In human face, the eyes and mouth have distinct shapes compared to other features. That is, the general shape of the eyes and mouth resides within a certain range of the space spanned by the eigenvectors and the eigenvalues describing their shapes. These important features lead to a facial feature detector that not only uses a relatively small number of training samples, but also lends themselves to robustness against pose variations. The features used for the neural network training are then obtained as follows. The correlation matrix M is obtained "rst from the N edge point samples: N P"(p ,2, p ,2, p N ), G ,
(12)
Fig. 5. Examples of the training set for the MLPs: (a) the positive eye samples; (b) some negative eye samples; (c) the positive mouth samples and (d) some negative mouth samples.
Y.-S. Ryu, S.-Y. Oh / Pattern Recognition 34 (2001) 2459}2466
Fig. 6. The binary edge map and the corresponding eigenvectors for each region.
2463
Fig. 7. (a) The MLP output pro"le map and (b) the resulting regions of interest found.
where
u p" G , G v G M"P ) P2,
(13)
where (u , v )2 is the coordinate value of the ith sample G G point. Let the eigenvalues and the eigenvectors of M be , , e , e , where ' . The eigenvectors represent the mutually orthogonal principal directions of the data distribution while the eigenvalues represent the strength of variation along the direction of the eigenvectors [12]. From these, the following nine eigenfeatures are derived: x " , x " , x " , x "e , } } S x "e , x "e , x "e , T S T (14) 1 ,N 1 ,N u /N , x " v /N , x " G S G T N N N G N G where N and N are the horizontal and vertical sample S T sizes for the eye or the mouth respectively and } and } are the maximum of and over the entire training set. All these features are signi"cant in some sense to represent the shape of the facial features. The "rst and second eigenfeatures are the normalized eigenvalues while the third one is their ratio. These three represent the overall shape of the facial features. The fourth to seventh components represent the orientations of the facial features. Finally, the eighth and ninth components are the two coordinates of the centroid of the candidate region containing the facial features. Section 4 will examine the in#uence of using di!erent sets of eigenfeatures on the extraction performance.
3.2. Fine localization of the facial features from the neural networks based on eigenfeatures A small window of either 20;10 for the eyes or 40;20 for the mouth is shifted across and down within the
Fig. 8. The centering process for the region detected by the MLP: (a) before the centering and (b) after the centering.
coarse Left}Eye, Right}Eye, and Mouth regions found in Section 2. While being shifted, the eigenfeatures of these small windows are formed and input to the MLP. As stated earlier, the output of the MLP denotes the similarity between the eigenfeatures of the current window and those of the typical facial features. Then for each facial feature, the index of the sliding window resulting in maximum similarity is determined and identi"ed as the "ner eye or mouth "eld. Fig. 7(a) shows the output distribution of the M¸P versus the amount of window shift, while +MSRF Fig. 7(b) shows the results of extraction of the three facial features. Zero outputs occur because the test windows that have a number of edge pixels below a threshold (20 for the eye and 40 for the mouth) have been skipped since in that case the possibility of "nding the facial features of interest is quite low. Finally, these "ner regions are re-centered as shown in Fig. 8, thereby reducing the load of learning many di!erent shifts of the same feature images.
4. Experimental results and discussion The structure of the neural nets used is (9, 50, 30, 10, 1): "ve-layer neural nets having input layer of 9 nodes, output layer of 1 node for showing the degree of
2464
Y.-S. Ryu, S.-Y. Oh / Pattern Recognition 34 (2001) 2459}2466
Fig. 9. Experimental results of extracting the regions of interest. The adjacent pairs of images represent the result of coarse "nding before matching (on the left) with neural nets and those of "ner "nding after matching (on the right).
similarity between learned features and input feature, and three intermediate layers with 50, 30 and 10 nodes respectively, with a learning rate of 0.01. As for the facial database, 61 images taken from 14 persons in total were used. The database contains the training set as well as the test set containing faces with di!erent poses (looking left,
right or down) and/or from persons other than those in the training set. The subjects did not wear glasses nor had beards (Fig. 2(b)). The positive samples were taken from the eyes and mouths in Fig. 2(a). The negative samples (70 for the eyes and 29 for the mouth) partially shown in Fig. 5(b) and (d) were taken from around the eyes and the
Y.-S. Ryu, S.-Y. Oh / Pattern Recognition 34 (2001) 2459}2466 Table 1 The detection performance (%) according to the number of eigenfeatures used MLP
Case 1
Case 2
Case 3
Case 4
M¸P } *CDR #WC
77.1 (39.3)
83.6 (24.6)
95.1 (24.6)
96.8 (16.4)
M¸P 0GEFR}#WC
72.1 (37.7)
70.5 (36.0)
72.2 (34.4)
96.8 (19.6)
M¸P +MSRF
88.6 (55.7)
88.6 (40.9)
96.8 (22.9)
100 (11.4)
Six eigenfeatures (x, x, x}x) used without re-centering. G G G G Seven eigenfeatures (x}x) used without re-centering. G G Nine eigenfeatures (x}x) used without re-centering. G G Nine eigenfeatures (x}x) used with re-centering. G G
mouth plus the ones that resulted in misrecognition by the neural network. To save time, only those test windows with more than a certain number of edge pixels were processed and when the features were not found in the coarse feature window, this threshold was lowered and retried. Sometimes, the eyes could not be found when the coarse eye region was erroneously set, when the region was too dark, or when the eyes were too small. Fig. 9, due to lack of space, shows only some of the results of "nding the three facial features of test face using the full set of 9 eigenfeatures. Table 1 compares the performance of the four cases of di!erent combination of input features selected for the MLP. Successful localization of the facial features is de"ned as the case when at least 4/5 of these features falls within the region, namely when the features are reasonably centered. In Table 1, the numbers within ( ) show the ratios of the non-centered cases over all cases. The di!erences of 14.7% for the left eye, 1.7% for the right eye and 14.8% for the mouth between Case 1 and 2, for example, signify that the eigenvalue ratio is important for accurate localization of the eyes and the mouth. In addition, the use of the normalized centroid of the data in the image space is important to locate the mouth 18% better. Finally, the "nal centering process also improves the localization performance by 8.2, 14.8 and 11.5% for the left eye, right eye and mouth, respectively. The experimental data suggest that the use of eigenfeatures and the neural network leads to accurate "nding of the three facial features, and further, each eigenfeature plays a certain role in the process.
5. Summary and conclusions A facial feature extraction algorithm from black and white facial images has been presented based on eigenfea-
2465
tures and neural networks. The eigenfeatures were obtained from an expanded set of the eigenvectors and eigenvalues calculated from the binary edge data set for each of the facial features. The three MLPs for the facial features were trained with positive and negative samples so that the MLP can make a decision as to whether the current test window contains the facial feature of interest. The particular window resulting in maximum output for the MLP is designated as the "nal facial feature location after image centering. As a preprocessing step, the vertical and horizontal projections of the binary edge map from the face image have been used to narrow down the search. The salient features of the proposed methodology are as follows: E No a priori image normalization is necessary with respect to intensity distribution, size, and position. E The coarse eye and mouth regions are determined within the facial image. While existing research "nds the approximate eye position using the horizontal edge map, the proposed research utilizes the horizontal projection of the vertical edge map to capture the eye region within smaller window. E The eigenfeatures extracted from the 2-D data distribution of the eye and mouth edge features have been introduced as important and signi"cant features to locate the eye and mouth. E The need for a large training set in template matching has been avoided by using eigenfeatures and sliding windows within the coarse feature "elds. E The MLPs have a linear output neuron whose output represents the similarity to any of the positive samples. E The feature localization experiments have been performed on 61 facial images of 14 people with varying facial size and pose and the recognition rate is 96.8% for the eyes and 100% for the mouth. E Good generalization towards the same or di!erent person with largely varying size and pose was achieved despite using a small training data set taken from only "ve or six persons. Facial feature extraction has a wide spectrum of applications from user veri"cation and recognition to emotion recognition in the framework of man}machine interfaces. Future work includes the development of a robust pose-and lighting-invariant face recognition system as an extension of the current research.
Acknowledgements The authors wish to acknowledge the "nancial support of the Korea Research Foundation made in the program year of 1998 and also in part by the Ministry of Education of Korea toward the Electrical and Computer Engineering Division at POSTECH through its BK21 program.
2466
Y.-S. Ryu, S.-Y. Oh / Pattern Recognition 34 (2001) 2459}2466
References [1] Yankang Wang, H. Kuroda, M. Fujumura, A. Nakamura, Automatic extraction of eye and mouth "elds from monochrome face image using fuzzy technique, Proceedings of the 1995 Fourth IEEE International Conference on Universal Personal Communications Record, Tokyo, Japan, 1995, pp. 778}782. [2] R. Pinto-Elias, J.H. Sossa-Azuela, Automatic facial feature detection and location, Proceedings of the 14th International Conference on Pattern Recognition, Vol. 2, 1998, pp. 1360}1364. [3] Weimin Huang, Q. Sun, C.P. Lam, J.K. Wu, A robust approach to face and eyes detection from images with cluttered background, Proceedings of the 14th International Conference on Pattern Recognition, Vol. 1, 1998, pp. 110}113. [4] Kin Choong Yow, R. Cipolla, Feature-based human face detection, Image Vision Comput. 15 (9) (1997) 713}735.
[5] Takeo Kanade, Picture processing by computer complex and recognition of human faces, Technical Report, Department of Information Science, Kyoto University, 1973. [6] D.J. Beymer, Face recognition under varying pose, A.I Memo No. 1461, 1993. [7] R. Brunelli, T. Poggio, Face recognition: features versus templates, IEEE Trans. PAMI 15 (10) (1993) 1042}1052. [8] P. Juell, R. Marsh, A hierarchical neural network for human face detection, Pattern Recognition 29 (5) (1996) 781}787. [9] D.F. McCoy, V. Devarajan, Arti"cial immune systems and aerial image segmentation, IEEE Int. Conf. Systems Man Cybernet. 1 (1997) 867}872. [10] M. Riedmiller, H. Braun, A direct adaptive method for faster backpropagation learning: the RPROP algorithm, IEEE Int. Conf. Neural Networks 1 (1993) 586}591. [11] E. Fiesler, R. Beale, Handbook of Neural Computation, Oxford University Press, Oxford, 1997. [12] S. Romdhani, Face recognition using principal components analysis, MS Thesis, http://www.elec.gla.ac.uk/ %romdhani/pca.htm, 1996.
About the Author*YEON-SIK RYU received the BS degree in Electrical Engineering from HanYang University, Seoul, Korea in 1988, and the MS degree from Pohang University of Science and Technology (POSTECH), Pohang, Korea in 1990. He is currently a research engineer of LG Electronics Inc. and is working toward a Ph.D. in Electrical Engineering. His research interests are in the area of neural networks, immune system, and evolutionary computation and its application to automation systems. About the Author*SE-YOUNG OH received the BS degree in Electronics Engineering from Seoul National University, Seoul, Korea in 1974, and the MS and Ph.D. degree in Electrical Engineering, from Case Western Reserve University, Cleveland, OH, USA, in 1978 and 1981 respectively. From 1981 to 1984, he was an Assistant Professor in the Department of Electrical Engineering and Computer Science, University of Illinois at Chicago, Chicago, IL. From 1984 to 1988, he was an Assistant Professor in the Department of Electrical Engineering at the University of Florida, Gainesville, FL. In 1988, he joined the Department of Electrical Engineering, Pohang University of Science and Technology, Pohang, Korea, where he is currently a Professor. His research interests include soft computing technology including neural networks, fuzzy logic, and evolutionary computation and its applications to robotics, control, and intelligent vehicles.