Using Radial Basis Probabilistic Neural Network for Speech Recognition Nima Yousefian, Morteza Analoui Computer Department, Iran University of Science &Technology Tehran, Iran
[email protected],
[email protected]
Abstract— Automatic speech recognition (ASR) has been a subject of active research in the last few decades. In this paper we study the applicability of a special model of radial basis probabilistic neural networks (RBPNN) as a classifier for speech recognition. This type of network is a combination of Radial Basis Function (RBF) and Probabilistic Neural Network (PNN) that applies characteristics of both networks and finally uses a competitive function for computing final result. The proposed network has been tested on Persian one digit numbers dataset and produced significantly lower recognition error rate in comparison with other common pattern classifiers. All of classifiers use Mel-scale Frequency Cepstrum Coefficients (MFCC) and a special type of Perceptual Linear Predictive (PLP) as their features for classification. Results show that for our proposed network the MFCC features yield better performance compared to PLP.
I. INTRODUCTION A major problem in speech recognition system is the decision of the suitable feature set which can accurately describe in an abstract way the original highly redundant speech signal. In non-metric spectral analysis, mel-frequency cepstral coefficients (MFCC) are one of the most popular spectral features in ASR. In parametric spectral analysis, the LPC mel-cepstrum based on an allpole model is widely used because of its simplicity in computation and high efficiency [1]. Another popular speech feature representation is known as RASTA-PLP, an acronym for Relative Spectral Transform - Perceptual Linear Prediction [2]. PLP was originally proposed by Hynek Hermansky as a way of warping spectra to minimize the differences between speakers while preserving the important speech information. RASTA is a separate technique that applies a band-pass filter to the energy in each frequency sub-band in order to smooth over short-term noise variations and to remove any constant offset resulting from static spectral coloration in the speech channel for example from a telephone line[3]. RASTA-PLP outperforms PLP for recognition of channel-distorted speech. Many pattern classifiers have been proposed for speech recognition. During the last several years, Gaussian Mixture Models (GMMs) became very popular in Speech Recognition systems and have proven to perform very
well for clean wideband speech [4]. However, in noisy environment or for noisy band-limited telephone speech the performance of GMM degrades considerably. Another well known contemporary classification technique, Vector Quantization (VQ) is tested with our dataset and showed absolutely high recognition rate between other classifiers. So we decided to design a new probabilistic neural network (PNN) that uses a competitive function for its transfer function such as VQ networks. Results show that this network overcomes all other pattern classifiers in recognition of Persian one digit numbers dataset. PNNs are known to have good generalization properties and are trained faster than the back propagation ANNs. The faster training is achieved at the cost of an increased complexity and higher computational and memory requirements [5]. II. SYSTEM CONCEPT A. Dataset A sequence of 10 isolated digits (0, 1, 2, …, 9) Voices from 35 different speakers were recorded in Computer Department of Iran University of Science and Technology. All of voices saved as wave files with audio sample size as 16 bit and audio sample rate 16 KHZ with mono channel. So there are 350 wave files. We divided them into two separate parts, 20 speakers (200 wave files) for training and 15 remaining speakers (150 wave files) for testing. So the ratio of train to test is 4:3. B. Features Extraction The goal of feature extraction is to represent speech signal by a finite number of measures of the signal. This is because the entirety of the information in the acoustic signal is too much to process, and not all of the information is relevant for specific tasks. In present ASR systems, the approach of feature extraction has generally been to find a representation that is relatively stable for different examples of the same speech sound, despite differences in the speaker or various environmental characteristics, while keeping the part that represents the message in the speech signal relatively intact.
In this paper two well known methods are used. For calculating MFCC features each speech signal is divided into four equal length signals and 12 order coefficients for each signal are computed by using 16 KHz as sampling rate and hamming window and 0.5 for high end of highest filter as a fraction of sampling rate. So 48 feature vectors for each signal contribute in training and test phase. A more sophisticated technique is the RASTA-PLP approach. RASTA basically consists of additional steps applied after the critical-band integration of PLP. Based on an observation that human perception is more sensitive to relative changes, the static or slow-changing part of the signal (in each critical-band) is effectively filtered out. Because most distortions due to unknown channel characteristics are convolute, RASTA is best implemented in a logarithmic domain of the spectrum, which makes the different parts additive. Same as MFCC for calculating RASTA-PLP coefficients, signal is separated into four signals. Again 16 KHz sampling rate applied for calculating 12 order coefficients of RASTA-PLP. For some classifiers adding first and second order time derivatives of coefficient results in better performance than pure PLP [6]. But for calculating derivatives we should be aware to take shorter window than normal. So 144 (12*4*3) feature vectors for each signal contribute in training and test phase. This method is RASTA-Δ2 PLP. C. Recognition Methodology In multi-class mode such as our case, each classifier tries to identify whether the set of input feature vectors, derived from the current signal, belongs to a specific class of numbers or not, and to which class exactly. For samples that can not be realized as a specific class a random class is selected. III. CLASSIFIERS Several classifiers are tested for mentioned dataset. The structures of successful classifiers in recognition are described in following subsections. A. Gaussian Mixture Models 1) Structure: Gaussian mixture models (GMMs) are one of the semi-parametric techniques for estimating probability density functions (pdf) [7]. The output of a Gaussian mixture model is the weighted sum of R component densities, as shown in Fig.1. The training of GMM can be formulated as maximum likelihood problem where mean vectors, covariance matrices and prior probabilities are estimated by Expectation-Maximization (EM) algorithm. Given a set of N independent and identically distributed patterns, Xi={ x(t) ; t=1,2, …, N} associated with class ωi , it assumed that the class likelihood function p(x(t) | ωi) for class ωi is a mixture of
p(x(t) | ωi)
Σ
p (θ1 | i | ωi)
p(θ R | i | ωi)
… p(x(t) | ωi , θ1 | i)
p(x(t) | ωi , θR | i )
x(t) Figure1: Architecture of GMM
Gaussian distributions, which denoted by: R
p( x(t) | ωi ) = ∑P(θ r | i | ωi ) p( x(t) | ωi ,θr | i )
(1)
r =1
where θ r | i represents the parameters of the rth mixture component, R is the total number of mixture component.
p( x(t ) | ωi , θ r | i ) = N ( μ r | i , Σ r | i )
(2)
typically N( μ r | i , Σ r | i ) is a Gaussian distribution that has mean μ r | i and covariance Σ r | i .last equation denotes probability density function of the rth component and finally P (θ r | i | ω i ) is the prior probability of the rth component. 2) Effect of parameters in recognition: Here we assume number of mixtures R=3. Increasing this parameter dose not has significant effect on performance. The method of initializing mean and variance is important. After examining various methods, k-harmonic initialization method produced best results in comparison to k-means and randomly selected data points algorithms. The number of iterations set to 10 or it will stop when the increase in log likelihood falls below 0.001. In our work in the training phase, GMM usually recognizes all samples with no mistake so recognition rate is nearly 1, which overcomes all other classifiers in this case. GMM is one of the classifiers in our work that gets better result with RASTA-PLP than MFCC.
Figure 2: Architecture of LVQ
B. Learning Vector Quantization 1) Structure: LVQ or Learning Vector Quantization is a prototype-based supervised classification algorithm. LVQ can be understood as a special case of an artificial neural network, more precisely; it applies a winner-takeall Hebbian learning-based approach. It is a precursor to Self-organizing maps (SOM) and related to the k Nearest Neighbor algorithm (k-NN) [8]. An LVQ network has a first competitive layer and a second linear layer. The competitive layer learns to classify input vectors in much the same way as the competitive layers of Self-Organizing Nets. The linear layer transforms the competitive layer's classes into target classifications defined by the user. The network is given by prototypes W= (w (i),..., w(n)). It changes the weights of the network in order to classify the data correctly. For each data point the prototype (neuron) that is closest to it, is determined (called the winner neuron). The weights of the connections to this neuron are then adapted, i.e. made closer if it correctly classifies the data point or made less similar if it incorrectly classifies it. For a test sample that LVQ can not relate it to any class a random class is considered. An advantage of LVQ is that it creates prototypes that are easy to interpret for experts in the field. 2) Effect of parameters in recognition: There are various types of LVQ. In this work LVQ1 is applied for classification. The number of neurons in first layer set to number of training speakers. It is better to set learning rate near zero, increasing it has unexpected influence in accuracy of recognition. For initializations of the weights LVQ should know the prior class Percentages, as the same probability for each 10 classes we set this parameter to 0.1 for all classes. Total number of epochs
for training is 150.Increasing it has no improvement in performance of recognition. C. Other Classifiers Some other neural networks such as RBF, ARTMAP, and MLP have been tested with our dataset. All of these networks earned accuracy rate below 70 % in the test phase. But in RBF network if we modify transfer function of layer two with special linear layer and set layer weights as training data great growth in performance occurs. This is one of the points that lead us to proposed probabilistic neural network in this study. K-means is a clustering method. Here is a definition of a special model of supervised k-means, which can produce accuracy result for training and test phase. If a random sample from each of 10 classes been selected and use them as initialization seed points for k-means we can train data and get accuracy recognition rate for these data. If number of selecting random seed points repeats more, recognition rate may improve. After getting best performance in training, related seed points considered as centers for calculating a test sample distances to these centers separately. So for each test signal Euclidean distance between its coefficients with all ten centers coefficients is calculated and the center with minimum distance is considered as the class of test sample. IV) COMPETITIVE RADIAL BASIS PNN As described in previous section LVQ and GMM have best recognition rate over all other classifiers. This point result in designing a probabilistic neural network that uses Gaussian distribution as its probability kernel function of network and a competitive function as its transfer function like first layer of LVQ. In following subsections, network procedure and architecture will be described in details.
Radial Basis Layer
Competitive Layer
Input Q*R
P
IW 1,1
||dist||
n1
R*1
a Radbas
Q*1
1 R
n2
1
Q*1
LW 2,1
L *1
L*Q
L*1 C
a 2= y
b1 Q*1
Figure 3: Architecture of proposed radial basis PNN R=number of elements in input data Q=number of input / target Pairs=number of neurons in layer1 L=number of classes of input data=number of neurons in layer 2
A.
Procedure The radial basis probabilistic neural network (RBPNN) model is in substance developed from the radial basis function neural network (RBFNN) and the probabilistic neural network. Therefore, the RBPNN possesses the common characteristic of the original two networks, i.e., the signal is concurrently feed-forwarded from the input layer to the output layer without any feedback connections. On the other hand, the RBPNN, to some extent, decreases the two original models demerits. When an input is presented, the first layer computes distances from the input vector to the training input vectors, and produces a vector whose elements indicate how close the input is to a training input. The second layer sums these contributions for each class of inputs to produce as its net output a vector of probabilities. Finally, a competitive transfer function on the output of the second layer picks the maximum of these probabilities, and produces a 1 for that class and a 0 for the other classes. B. Network Architecture Illustration of the network architecture is in figure 3. It is assumed that there are Q input vector/target vector pairs. Each target vector has L elements. One of these elements is 1 and others are 0. Thus, each input vector is associated with one of L classes. The first-layer input weights, IW1,1 are set to the transpose of the matrix formed from the Q training pairs, P'. This is exactly what mentioned in previous section for growing performance of RBF. When an input is presented the ||dist|| function
produces a vector whose elements indicate how close the input is to the vectors of the training set. These elements are multiplied, element by element, by the bias and sent to the Radbas transfer function. Radbas function is defined with following equation: Radbas (n) = e
−n2
(3)
An input vector close to a training vector is represented by a number close to 1 in the output vector a1. If an input is close to several training vectors of a single class, it is represented by several elements of a1 that are close to 1. ai1 =Radbas ( || IWi 1,1 – P|| bi1)
(4)
The second-layer weights, LW2,1 are set to the matrix T of target vectors. Each vector has a 1 only in the row associated with that particular class of input, and 0's elsewhere. The multiplication Ta1 sums the elements of a1 due to each of the L input classes. Finally, the secondlayer transfer function, compete, produces a 1 corresponding to the largest element of n2, and 0's elsewhere. a 2 = compet ( LW 2,1 a1 )
(5)
where compet is a transfer function. compet (N) takes one input argument, a matrix of net input vectors and returns output vectors with 1 where each net input vector has its maximum value, and 0 elsewhere. Thus, the network has classified the input vector into a specific one of L classes because that class had the maximum probability of being correct. So each input certainly has a class as output.
V) EXPERIMENTS and RESULTS A.
Classifiers Comparison As mentioned in previous sections 350 wave files containing voices of 35 speakers divided into 2 parts, train and test with ratio of 4:3. All of speakers are male. Here 4 classifiers with best performances are selected. We name the edited version of k-means that introduced in third section edited k-means or Ek-means. The result from this classifier is desirable and nearly better than GMM. In the training nearly all of described classifiers recognized training patterns with accuracy above 95%. Best performance belongs to GMM with accuracy near 1 in this phase. In the test phase each classifier has been tested with each of feature vectors described before. Performance of different classifiers in the test phase for all of the three introduced feature types can be observed in table 1. Obviously RBPNN with MFCC gets best result between all other methods. But it is noticeable that for all classifiers adding time derivations of PLP to itself has not improvement effect in recognition rates. Ek-means like GMM and LVQ produce better outputs with PLP than MFCC. The proposed method is the only one that gets better accuracy with MFCC. As we described in feature extraction subsection original signal is divided into four parts. This is because of average of numbers of phoneme in numbers from 1 to 10 in Persian is about four. But some classifiers produce better output by setting this factor to three. So in table1 overall performance for each classifier is calculated in best situation. Table1: Performance Comparison (%)
Features
MFCC
RASTAPLP
RASTAΔ2 PLP
RBPNN
99.1
95.3
90.1
LVQ
96.7
97.3
86.3
GMM
90.8
91.7
87.2
Ek-means
94.3
96.8
78.3
B. Capability of Proposed Method There is something remarkable about accuracy of this version of RBPNN. If all of the three other classifiers compared in table1 combine to make a single classifier with majority vote approach, it will gain accuracy about 98.2% which is still about 1% less than proposed competitive RBPNN.
Some methods have been proposed for improving performance of RBPNN. Recursive Orthogonal Least Squares Algorithms combined with Micro-genetic Algorithms (μ-GA) were tested in [9] for optimizing and improvement in results of spirals classification problem and IRIS Classification problem. Results show that, after optimization, quality of network grows and it is certainly much better than traditional PNN and RBF networks. Such approaches can be considered in future works. ACKNOWLEDGMENT The authors would like to thank Dr. A.Akbari for his permission to use wave files as dataset. The authors would also like to thank the reviewers for their comments that greatly improved the manuscript. REFRENCES [1] Harosha Matsumoto, Masanora Moroto, “Evaluation of MEL-LPC cepstrum in a large vocabulary continuous speech recognition”, Acoustics, Speech, and Signal Processing Proceedings IEEE International Conference on Volume 1, Issue, 2001, pp. 117-120, 2001. [2] Edric Gaudard , Guillermo Aradilla ,” Speech Recognition based on Template Matching and Phone Posterior probabilities” , IDIAP-Com 07-02, 2007. [3] H. Hermansky, ”Perceptual Linear Predictive (PLP) Analysis of Speech.”, The Journal of the Acoustical Society of America, Volume 87, Issue 4, pp.1738-1752 ,April 1990. [4] Todor Ganchev, nastasios,Tsopanoglou, Nikos Fakotakis, George Kokkinakis, “Probabilistic neural networks combined with GMMs for speaker recognition over telephone channels”, 14 International Conference on Digital Signal , Volume II, pp.1081-1084 ,July 2002. [5] Todor Ganchev, Dimitris K. Tasoulis, Michael N. Vrahatis, “Locally Recurrent Probabilistic Neural Network for Text-Independent Speaker Verification”, in Proc. of the EuroSpeech, 2003, vol. 3, pp. 762-766, 2003. [6] Qi Li, Frank K. Soong, Olivier Siohan, “An Auditory System-based Feature for Robust Speech Recognition”, 7th European Conference on Speech Communication and Technology, Aalborg-Denmark, pp. 619621, September 2001. [7] K.K. Yiu, M.W. Mak, C.K. Li, “Gaussian mixture models and probabilistic decision-based neural network for pattern classification: A comparative Study”, Neural Computing and Applications 8(3): pp. 235245, 1999. [8] Kosaka, T, Omatu, S, “Classification of the Italian Lira using the LVQ method”, Systems, Man, and Cybernetics, 2000 IEEE International Conference on Volume 4, pp. 2769 - 2774, 2000 [9] Wenbo Zhao, De-Shuang Huang, Lin Guo, “Optimizing Radial Basis Probabilistic Neural Networks Using Recursive Orthogonal Least Squares Algorithms Combined with Micro-Genetic Algorithms”, Neural Networks, 2003. Proceedings of the International Joint Conference on Volume 3, pp. 2277 - 2282, July 2003 [10] Leszek Rutkowski, “Adaptive Probabilistic Neural Networks for Pattern Classification in Time-Varying Environment”, IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO.4, pp. 811-827, 2004. [11] J.Barry Gomm, Ding Li Yu, “Selecting Radial Basis Function Network Centers with Recursive Orthogonal Least Squares Training”, IEEE Trans. Neural Networks, vol.lI,N0.2 , March 2000. [12] F. Gorunescu, “Architecture of probabilistic neural networks: estimating the adjustable smoothing factor,” Research Notes in Artificial Intelligence and Digital Communications, 104, pp. 56-62, 2004.