Feature Decorrelation on S Speech hR Recognition iti H Hung-Shin Shi Lee L Institute of Information, Sinica Academic D off Electrical Dep. El t i l E Engineering, i i N National ti lT Taiwan i U University i it 2009-10-09 @ IIS, Sinica Academic
References -11 1) 2) 3) 4) 5) 6) 7) 8)
J. Psutka and L. Muller, “Comparison of various feature decorrelation techniques in automatic speech recognition recognition,” in Proc. Proc CITSA 2006. 2006 Dr. Berlin Chen’s Lecture Slides: http://berlin.csie.ntnu.edu.tw Batlle et al., “Feature decorrelation methods in speech recognition - a comparative study, study,” in Proc. ICSLP 1998. K. Demuynck et al., “Improved feature decorrelation for HMM-based speech recognition,” in Proc. ICSLP 1998. W. Krzanowski, Principles of Multivariate Analysis - A User’s Perspective, Oxford Press, 1988. R. Gopinath, “Maximum likelihood modeling with Gaussian distributions for classification,” in Proc. ICASSP 1998. M. Gales, “Semi-tied covariance matrices for hidden Markov models,” IEEE Trans. on Speech and Audio Processing, vol. 7, no. 3, pp. 272-281, 1999. P. Olsen and R. Gopinath, “Modeling inverse covariance matrices by basis expansion ” in Proc. expansion,” Proc ICASSP 2002. 2002 2
References -2 2 9) 10) 11) 12) 13) 14)
P. Olsen and R. Gopinath, “Modeling inverse covariance matrices by basis expansion ” IEEE Trans. expansion, Trans on Speech and Audio Processing, Processing vol. vol 12, 12 no. no 11, 11 pp. pp 37-46, 2004. N. Kumar and R. Gopinath, “Multiple linear transform,” in Proc. ICASSP 2001. N. Kumar and A. Andreou, “Heteroscedastic Heteroscedastic discriminant analysis and. reduced rank HMMs for improved speech recognition,” Speech Communication, vol. 26, no. 4, pp. 283-297, 1998. A. Ljolje, “The importance of cepstral parameter correlations in speech recognition,” Computer Speech and Language, vol. 8, pp. 223–232, 1994. B. Flury, “Common principal components in k groups,” Journal of the American Statistical Association, vol. 79, no. 388, 1984. B. Flury and W. Gautschi, “An algorithm for simultaneous orthogonal transformation of several positive definite symmetric matrices to nearly diagonal form,” SIAM Journal on Scientific and Statistical Computing, vol. 7, no 1, no. 1 1986. 1986 3
Outline • Introduction to Feature Decorrelation (FD) – FD on Speech Recognition
• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT) • Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT) • Common Principal Component (CPC) 4
Outline • Introduction to Feature Decorrelation (FD) – FD on Speech Recognition
• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT) • Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT) • Common Principal Component (CPC) 5
Introduction to Feature Decorrelation -11 • Definition of Covariance Matrix: – Random vector: X = [ x1 xn ]T random variable
– Covariance matrix and mean vector: Σ ≡ E[( X − μ)( X − μ)T ] expected value
random vector
μ ≡ E[X] 6
Introduction to Feature Decorrelation -2 2 • Feature Decorrelation (FD) – To find transformations Θ that make all variables or parameters (nearly) uncorrelated ~ ~ ~ ~ cov( X i , X j ) = 0, ∀X i , X j , i ≠ j transformed random variable
– Or make the covariance matrix diagonal (not necessary identity) ΘT ΣΘ = D covariance matrix
diagonal matrix
The covariance matrix can be global or depend on each class. 7
Introduction to Feature Decorrelation -33 • Why Feature Decorrelation (FD)? – In many speech recognition systems, the observation density function for each HMM state are modeled as mixtures of diagonal g covariance Gaussians – For the sake of computational simplicity, the off-diagonal elements of the covariance matrices of the Gaussians are assumed to be close to zero f (x) = π i N (x; μ i , Σ i ) i
observation density function
1 1 T −1 exp ( x μ ) Σ ( x μ ) = π i − − − i i i 12 12 (2π ) | Σ i | 2 i diagonal matrix 8
FD on Speech Recognition -11 • Approaches for FD can be divided into two categories: Feature-space and Model-space • Feature-space Schemes – Hard to find a single transform which decorrelates all elements of the feature vector for all states
• Model-space Schemes – A different transform is selected depending on which component the observation was hypothesized to be generated from – In the limit a transform may be used for each component, which hi h iis equivalent i l t tto a ffullll covariance i matrix t i system t 9
FD on Speech Recognition -2 2 • Feature-space Schemes on LVCSR DCT, PCA, LDA, MLLT, … Feature-space Decorrelation Speech Signal
Front-End Preprocessing
Test Data
Speech Decoding
Textual Results
Training Data
AM Training
AM
LM
Lexicon
10
FD on Speech Recognition -33 • Model-space Schemes on LVCSR Speech Signal
Front-End Preprocessing
Test Data
Speech Decoding
Textual Results
Training Data
AM Training g
AM
LM
Lexicon
Model-space p Decorrelation MLLT,, EMLLT,, MLT,, Semi-Tied,, … 11
FD on Speech Recognition -4 4 Without Label Information
With Label Information
Discrete Cosine Transform (DCT)
Linear Discriminant Analysis (LDA)
Principal Component Analysis (PCA)
Common Principal Components (CPC)
Feature-Space F t S Schemes
Extended Maximum Likelihood Linear Transform (EMLLT) Semi-Tie Covariance Matrices Multiple Linear Transforms (MLT)
Model-Space S h Schemes
Maximum Linear Likelihood Transform (MLLT)
12
Outline • Introduction to Feature Decorrelation (FD) – FD on Speech Recognition
• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT) • Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT) • Common Principal Component (CPC) 13
Discrete Cosine Transform -11 • Discrete Cosine Transform (DCT) – Since the log-power spectrum is real and symmetric, the inverse DFT reduces to a Discrete Cosine Transform (DCT) – Is Applied pp to the log-energies g g of output p filters (Mel-scaled ( filterbank) during the MFCC parameterization 2 n π × j cj = xi cos (i − 0.5) , ffor j = 0,1,..., m < n n i =1 n i-th coordinate of the input vector x j-th coordinate of the output vector c
– Partial decorrelation 14
Discrete Cosine Transform -2 2 • (Total) covariance matrix calculated using the 25-hour training data from MATBN 18 Mel-scaled Filterbank
133 Mel-cepstrum p ((using g DCT))
2
8 6
15 1.5
4 1 2 0.5
0
0 20
-2 40 15
20 15
10
10
5
5 0
0
30
40 30
20
20
10
10 0
0
15
Principal Component Analysis -11 • Principal Component Analysis (PCA) – Based on the calculation of the major directions of variations of a set of data points in a high dimension space – Extracts the direction of the g greatest variance,, assuming g that the less variation of the data, the less information it carries of the features – Principal components V = [ v1 ,..., v p ] : the largest eigenvectors of the total covariance matrix Σ
Σv i = λi v i V T ΣV = D diagonal matrix 16
Principal Component Analysis -2 2 • (Total) covariance matrix calculated using the 25-hour training data from MATBN 162 Spliced Mel Mel-Filterbank Filterbank
39 PCA Transformed Subspace
2
150
15 1.5
100
1 50 0.5 0
0 -0.5 200
-50 40 150
200 150
100
100
50
50 0
0
30
40 30
20
20
10
10 0
0
17
Linear Discriminant Analysis -11 • Linear Discriminant Analysis (LDA) – Seeks a linear transformation matrix Θ that satisfies
(
Θ = arg max trace (ΘT SW Θ) −1 (ΘT S B Θ) Θ
)
– Θ is formed by the largest eigenvectors of SW−1S B – The LDA subspace is not orthogonal (but PCA subspace is)
ΘT Θ ≠ I – The LDA subspace makes the transformed f variables statistically uncorrelated.
18
Linear Discriminant Analysis -2 2 • From any two distinct eigenvalue/eigenvector pairs (λi , θ i ) and (λ j , θ j ) S B θ i = λi SW θ i S B θ j = λ j SW θ j
– Pre-multiplying p y g byy θTj and θTi , respectively p y
θTj S B θ i = λi θTj SW θ i θTj S B θ i =θTi S B θ j T T ⎯ ⎯ ⎯ ⎯ ⎯ → λ θ S θ = λ θ T i j W i j i SW θ j T θ i S B θ j = λ j θ i SW θ j θTj SW θ i =θTi SW θ j and λi ≠ λ j
⎯⎯ ⎯ ⎯ ⎯ ⎯ ⎯⎯→ θTj SW θ i = θTi SW θ j = 0 ∴ θTi SW θ j = 0 for f ∀i≠ j Ref. (Krzanowski, 1988)
19
Linear Discriminant Analysis -33 • To overcome arbitrary scaling of θ i , it is usual to adopt the normalization θTi SW θ i = 1 θTi S B θ i = λi θTi SW θ i = λi
∴ Θ T SW Θ = I , Θ T S B Θ = Λ
λ1 0 0 Λ = 0 0 0 0 λp
• The total covariance matrix ST = SW + S B is also transformed as a diagonal matrix
20
Linear Discriminant Analysis -4 4 • (Total) covariance matrix calculated using the 25-hour training data from MATBN 162 Spliced p Mel-Filterbank
39 LDA Transformed Subspace p
2
15
1.5
10
1 5 0.5 0
0 -0.5 200
-5 40 150
200 150
100
100
50
50 0
0
30
40 30
20
20
10
10 0
0
21
Maximum Likelihood Linear Transform • Maximum Likelihood Linear Transform (MLLT) seems to have two types: – Feature-based: (Gopinath, 1998) – Model Model-based: based: (Olsen, (Olsen 2002) 2002), (Olsen, (Olsen 2004)
• The common goal of the two types – Find a global linear transformation matrix that decorrelate features.
22
Feature-based Feature based MLLT -11 • Feature-based Maximum Likelihood Linear Transform (F-MLLT) – Tries to alleviate four problems: (a) Data insufficiency implying unreliable models (b) Large storage requirement (c) Large computational requirement (d) ML is i nott discriminating di i i ti between b t classes l – Solutions (a)-(c): Sharing parameters across classes (d): Appealing to LDA
23
Feature-based Feature based MLLT -2 2 • Maximum Likelihood Modeling – The likelihood of the training data {(x i , li )} is given by N
N
i =1
i =1
p (x , {μ li }, } {Σ li }) = ∏ p (x i , {μ li }, } {Σ li }) = ∏ N 1
Log-Likelihood g
= (2π )
−
Nd 2
e
−
e
1 − ( x i −m li )T Σ l−i 1 ( x i −m li ) 2
(2π ) d | Σ li |
Appendix A
j 2j (( m j −μ j )T Σ −j1 (m j −μ j ) + trace( Σ −j1S j ) + log|Σ j | ) n
Nd log p (x , {μ j }, {Σ j }) = − log(2π ) 2 C nj − (m j − μ j )T Σ −j1 (m j − μ j ) + trace(Σ −j1S j ) + log | Σ j | j =1 2 N 1
(
sample mean
)
ML estimators sample covariance 24
Feature-based Feature based MLLT -33 • The idea of maximum likelihood estimation (MLE) is to choose the parameters {μˆ j } and {Σˆ j } so as to maximize log p(x1N ,{μ j }, {Σ j })
ˆ } – ML Estimators: {μˆ j } and {Σ j
What’s the difference between estimator and “estimate”? estimate ? “estimator” 25
Feature-based Feature based MLLT -4 4 • Multiclass ML Modeling
– The training data is modeled with Gaussians (μˆ j , Σˆ j ) – The ML estimators: μˆ j = m j , Σˆ j = S j Appendix B – The log-ML g value: ˆ j }) log p ∗ (x1N ) = log p(x1N , {μˆ j }, {Σ C C Nd nj nj (llog((2π ) + 1) − | S j | = g ( N , d ) − | S j | =− 2 j =1 2 j =1 2
– There is “no interaction” between the classes and therefore f unconstrained ML modeling is not “discriminating”
26
Feature-based Feature based MLLT -55 • Constrained ML – Diagonal Covariance – The ML estimators: μˆ j = m j , Σˆ j = diag(S j ) – The log-ML value:
log p
∗ diag
C
nj (x ) = g ( N , d ) − | diag(S j ) | j =1 2 N 1
– If one linearly transforms the data and models using a diagonal Gaussian, the ML value is
log p
∗ diag
How to choose Θ for each class? Appendix C
C
(
nj (x ) = g ( N , d ) − | diag(ΘTj S j Θ j ) | + log | Θ j | j =1 2 N 1
)
the Jacobian 27
Feature-based Feature based MLLT -6 6 • Multiclass ML Modeling – Some Issues – If the sample size for each class is not large enough then the ML parameter estimates may have large variance and hence be unreliable – The storage requirement for the model: O(Cd 2 ) – The computational requirement: O(Cd 2 ) Why?
– The parameters for each class are obtained independently: ML principle dose not allow for f discrimination between classes
28
Feature-based Feature based MLLT -77 • Multiclass ML Modeling – Some Issues (cont.) – Parameters sharing across classes: reduces the number of parameters, storage requirements, and computational requirements q – Hard to justify parameters sharing is more discriminating – We can appeal to Fisher Fisher’ss criterion of LDA and a result of Campbell to argue that sometimes constrained ML modeling is discriminating. But, what is discriminating?
29
Feature-based Feature based MLLT -8 8 • Multiclass ML Modeling – Some Issues (cont.) – We can globally transform the data with a unimodular matrix Θ and model the transformed data with diagonal Gaussians ((There is a loss in likelihood too)) det(Θ) = 1 – Among all possible transformation Θ, we can choose the one that takes the least loss in likelihood (In essence we will find a linearly transformed (shared) feature space in which the diagonal Gaussian assumption is almost valid)
30
Feature-based Feature based MLLT -9 9 • Another constrained MLE with sharing of parameters – Equal Covariance – Clustering
• Covariances Diagonalization and Cluster Transformation – Classes are grouped into clusters – Each E h cluster l t iis modeled d l d with ith a diagonal di l Gaussian G i in i a transformed feature space – The ML estimators: μˆ j = m j , Σˆ j = Θ C−T diag(Θ C S j ΘTC )Θ C−1 – The ML value: j
logg p
∗ diag
C
(
j
j
j
nj (x ) = g ( N , d ) − | diag( g(Θ C j S j ΘTC j ) | + logg | Θ C j | j =1 2 N 1
) 31
Feature-based Feature based MLLT -10 10 • One Cluster with Diagonal Covariance – When the number of clusters is one, there is single global transformation and the classes are modeled as diagonal Gaussians in this feature space p – The ML estimators: μˆ j = m j , Σˆ j = Θ −T diag(ΘT S j Θ)Θ −1 – The log-ML value: C
nj | diag(ΘT S j Θ) | + N log | Θ | j =1 2
∗ log pdiag (x1N ) = g ( N , d ) −
– The optimal Θ can be obtained by optimization as follows f C nj T Θ = arg min | diag(Θ S j Θ) | − N log | Θ | Θ j =1 2 32
Feature-based Feature based MLLT -11 11 • Optimization – the numerical approach: – The objective function C
nj | diag(ΘT S j Θ) | − N log | Θ | j=1 2
F (Θ) =
– Differentiate F with respect to A, and get the derivative G:
(
)
G (Θ) = NΘ −T − n j diag(ΘS j ΘT ) ΘS j j
−1
– Directly optimizing the objective function f is nontrivial and requires numerical optimization techniques and full matrix to be stored at each class 33
Outline • Introduction to Feature Decorrelation (FD) – FD on Speech Recognition
• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT) • Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT) • Common Principal Component (CPC) 34
Model-based Model based MLLT -11 • In model-based MLLT, instead of having a distinct covariance matrix for every component in the recognizer, each covariance matrix consists of two elements: – A non non-singular singular linear transformation matrix Θ shared over a set of components – The diagonal elements in the matrix Λ j for each class j
• The precision matrices are constrained to be of the form d
P j = Σ −j1 = ΘT Λ j Θ = λkj θ k θTk k =1
precision matrix Ref. (Olsen, 2004)
diagonal matrix
λ1j 0 0 Λj = 0 0 0 0 λdj 35
Model-based Model based MLLT -2 2 • M-MLLT fits within the standard maximum-likelihood criterion used for training HMM's with each state represented by a GMM m
f (x | Θ) = π j N (x; μ j , Σ j ) j =1
j th componentt weight j-th i ht
• Log-likelihood function: 1 L(Θ : X) = N
N
log f (x
i
| Θ)
i =1
36
Model-based Model based MLLT -33 • Parameters of the M-MLLT model for each HMM state Θ = {π 1m , μ1m , Λ1m , Θ} are estimated using a generalized expectation-maximization (EM) algorithm • The auxiliary “Q “Q-function” function” should be introduced introduced. – The Q-function satisfies the inequality
L(Θ : X) − L(Θˆ : X) ≥ Q(Θ, Θˆ ) − Q(Θˆ , Θˆ ) Θ = {π 1m , μ1m , Λ1m , Θ} → New ˆ m m ˆ m ˆ ˆ ˆ Θ = { π , μ , Λ1 , Θ} → Old 1 1
* A. P. Dempster et al., “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist. Soc. B, vol. 39, pp. 1–38, 1977.
37
Model-based Model based MLLT -4 4 • The auxiliary Q-function is 1 ˆ Q(Θ; Θ) = N 1 = N
N
m
N
m
γ
ij
L j (x i )
i =1 j =1
γ ij
2 i =1 j =1
The component log-likelihood of the observation i
L j (x i ) = log(π j N (x i ; μ j , Σ j ))
× (2 log π j − d log(2π ) + log | P j | − trace(P j (x i − μ j )(x i − μ j )T )
A posteriori probability of Gaussian component j given the observation i
γ ij =
exp( Lˆ j ( xi )) m
p( Lˆ ( x )) exp( k
i
k =1
38
Model-based Model based MLLT -55 • Gales’ Approach for deriving Θ Estimate the mean and the component weight, which are independent of the other model parameters 1 πj = N
N
γ i =1
N
ij
μ j = γ ij x i i =1
N
γ
ij
i =1
Use the current estimate of the transform Θ, and estimate the set of class-specific diagonal variances
λij = θ i S j θTi the i-th entry of diagonal variance of component j Ref. (Gales, 1999)
the i-th row vector of Θ
39
Model-based Model based MLLT -6 6 • Gales’ Approach for deriving Θ (cont.) Estimate the transform Θ using the current set of diagonal covariances θˆ i = c i G i−1 Nj G i = ( j )2 S j j σˆ diag i
N c i G i−1cTi The i-th row vector of the cofactors of the current Θ Appendix D
Go to step 2 until convergence, or appropriate criterion satisfied 40
Semi-Tied Semi Tied Covariance Matrices -11 • Introduction to Semi-Tied Covariance Matrices – A natural extension of the state-specific rotation scheme – The transform is estimated in a maximum-likelihood (ML) fashion g given the current model parameters p – The optimization is performed using a simple iterative scheme, which is guaranteed to increase the likelihood of the training data – An alternative approach to solve the optimization problem of DHLDA
41
Semi-Tied Semi Tied Covariance Matrices -2 2 • State-Specific Rotation – A full covariance matrix is calculated for each state in the system, and is decomposed into its eigenvectors and eigenvalues g s) Σ (full = U ( s ) Λ ( s ) U ( s )T
– All data from that state is then decorrelated using the eigenvectors calculated ο ( s ) (τ ) = U ( s )T ο(τ )
– Multiple diagonal covariance matrix Gaussian components are then trained Ref. (Gales, 1999), (Kumar, 1998)
42
Semi-Tied Semi Tied Covariance Matrices -33 • Drawbacks of State-Specific Rotation – It does not fit within the standard ML estimation framework for training HMM’s – The transforms are not related to the multiple-component models being used to model the data
43
Semi-Tied Semi Tied Covariance Matrices -4 4 • Semi-Tied Covariance Matrices – Instead of having a distinct covariance matrix for every component in the recognizer, each covariance matrix consists of two elements m) Σ ( m ) = H ( r ) Σ (diag H ( r )T
component specific diagonal covariance element semi-tied class-dependent, non-diagonal matrix, may be tied over a set of components 44
Semi-Tied Semi Tied Covariance Matrices -55 • Semi-Tied Covariance Matrices (cont.) – It is very complex to optimize these parameters directly so an expectation-maximization approach is adopted – Parameters for each component m m) (r ) M = {π ( m ) , μ ( m ) , Σ (diag , Θ } d
Θ ( r ) = H ( r ) −1
45
Semi-Tied Semi Tied Covariance Matrices -6 6 • The auxiliary Q-function Q( M , Mˆ ) = |Θ ˆ ( r ) |2 ( m ) T ˆ ( r )T ˆ ( m ) −1 ˆ ( r ) (m) ˆ ˆ γ ( τ ) log − ο τ − μ Θ Σ Θ ο τ − μ ( ) ( ) m diag (m) ˆ m∈M ( r ),τ | Σ diag |
(
p (qm (τ ) | M , ΟT )
| Σ(m) |
)
(
Σ ( m )−1
component m at time τ
46
)
Semi-Tied Semi Tied Covariance Matrices -77 • If all the model parameters are to be simultaneously optimized then the Q-function may be rewritten as ˆ ( r ) |2 | Θ ˆ − dβ Q( M , M ) = γ m (τ ) log l ˆ ( r ) W ( m )Θ ˆ ( r )T ) | | diag ( Θ m∈M ( r ),τ
where W
(m)
=
τ γ
m
(
(τ ) ο(τ ) − μˆ
τ γ
(m)
m
)(ο(τ ) − μˆ )
(τ )
(m) T
μˆ ( m )
τ γ (τ )ο(τ ) = τ γ (τ ) m
m
β=
γτ
m
(τ )
m∈M ( r ),
47
Semi-Tied Semi Tied Covariance Matrices -8 8 • The ML estimate of the diagonal element of the covariance matrix is given by
(
m) (r ) ( m ) ˆ ( r )T ˆ ˆ (diag Σ = diag Θ W Θ g
)
• The reestimation formulae for the component weights and transition probabilities are identical to the standard HMM cases (Rabiner, 1989) • Unfortunately, optimizing the new Q-function is nontrivial and more complicated, so an alternative approach h iis proposed d nextt 48
Semi-Tied Semi Tied Covariance Matrices -9 9 • Gales’ approach for optimizing the Q-function Estimate the mean, which is independent of the other model parameters Using the current estimate of the semi-tied transform Θˆ ( r ) m) ˆ ( r ) W ( m )Θ ˆ ( r )T ) estimate the set of component and Σˆ (diag = diag(Θ specific diagonal variances. variances This set of parameters will be r) r) ˆ (diag denoted as {Σˆ (diag } = {Σ , m ∈ M (r )} r) ˆ ( r ) using the current set {Σˆ (diag Estimate the transform f } Θ
Go to ((2)) until convergence, g , or appropriate pp p criterion satisfied 49
Semi-Tied Semi Tied Covariance Matrices -10 10 • How to carry out step (3)? – Optimizing the semi-tied transform requires an iterative estimation scheme even after fixing all other model p parameters – Selecting a particular row of Θˆ ( r ) , θˆ i( r ) , and rewriting the r) } former Q-function using the current set {Σˆ (diag ˆ Q( M , Mˆ ;{Σ
(r ) diag
}
ˆ (r ) | |Θ
o ( m ) (τ ) − μˆ ( m )
( r )T ( m ) 2 ˆ ˆ ( ( τ )) θ ο j m) ˆ (diag l (θˆ i( r )T ci ) 2 − log l |Σ | − = γ m (τ ) log( ( m ) 2 σ m∈M ( r ),τ j diag j
The ith row vector of the cofactors
the leading diagonal element j 50
Semi-Tied Semi Tied Covariance Matrices -11 11 • How to carry out step (3)? (cont.) – It is shown that the ML estimate for the ith row of the semi-tied transform, θˆ i( r ) , is given by θˆ i( r ) = c i G ( ri ) −1
G ( ri ) =
β c i G ( ri ) −1cTi
m∈M ( r )
1
σˆ
(m)2 diag i
W ( m ) γ m (τ ) τ
51
Semi-Tied Semi Tied Covariance Matrices -12 12 • It can be shown that r) ˆ (diag Q( M , Mˆ ) ≥ Q( M , Mˆ ;{Σ })
with equality when diagonal elements of the covariance m) ˆ ( r ) W ( m )Θ ˆ ( r )T ) matrix are given by Σˆ (diag = diag(Θ • During recognition the log-likelihood is based on log( L(ο(τ ); μ ( m ) , Σ ( m ) , Θ ( r ) )) m) = log( N (ο ( r ) (τ ); Θ ( r )μ ( m ) , Σ (diag )) + log | Θ ( r ) |
Θ ( r ) T ο(τ )
1 2 log | Θ ( r ) |2 52
Extended MLLT -11 • The extended MLLT (EMLLT) is much similar to M-MLLT. But the only difference is the precision matrix modeling D
P j = Σ = Θ Λ j Θ = λkj θ k θTk −1 j
T
k =1
basis
– – – –
Note that in EMLLT, D ≥ d , while in M-MLLT, D = d λkj are not required to be positive! {λkj } have h tto b be chosen h such h th thatt P j is i positive iti definite d fi it The authors provided two algorithms for iterative-update of the parameters
Ref. (Olsen, 2002), (Olsen, 2004)
53
Extended MLLT -2 2 • The highlight of EMLLT: – More flexible: the number of basis elements {θ k θTk } can be gradually varied from d (MLLT) to d(d+1)/2 (Full) by controlling the value of D
54
Outline • Introduction to Feature Decorrelation (FD) – FD on Speech Recognition
• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT) • Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT) • Common Principal Component (CPC) 55
Common Principal Components -11 • Why Common Principal Components (CPC)? – We often deal with the situation of the same variables being measured on objects from different groups, and the covariance structure mayy varyy from g group p to g group p – But sometimes the covariance matrices of different groups look somehow similar, and it seems reasonable to assume that the covariance matrices have a common basic structure
• Goal of CPC – To find a rotation diagonalizes the covariance matrices simultaneously
Ref. (Flury, 1984), (Flury, 1986)
56
Common Principal Components -2 2 • Hypothesis of CPC’s H c : BT Σ i B = Λ i , i = 1,..., k diagonal matrix
– The common principal components (CPC's): U i = BT x i
– Note that, no canonical ordering of the columns of B need be given since the rank order of the diagonal elements of the given, is not necessarily the same for all groups
57
Common Principal Components -33 • Assume ni S i are independently distributed as W p (ni , Σ i ) . The common likelihood function is ni −1 L( Σ1 ,..., Σ k ) = C × ∏ exp trace − Σ i S i Σ i 2 i =1 k
− ni 2
• Instead of maximizing the likelihood function, minimize g ( Σ1 ,..., Σ k ) = −2 log L( Σ1 ,..., Σ k ) + 2 log C k
= ni (log | Σ i | + trace(Σ i−1S i )) i =1
58
Common Principal Components -4 4 • Assume H C holds for some orthogonal matrix β , and Λ i = diag(λi1 ,..., λip ) , then p
log | Σ i |= log λij , i = 1,...,kk j =1
p
trace(Σ i−1S i ) = trace(βΛ i−1βT S i ) = trace(Λ i−1βT S i β) = j =1
βTj S i β j
λij
• Therefore g ( Σ1 ,..., Σ k ) = g (β1 ,..., β p , λ11 ,..., λ1 p , λ21 ,..., λkp ) βTj S i β j = ni log λij + λijj i =1 j =1 k
p
59
Common Principal Components -55 • The function g is to be minimized under the restrictions 0 if h ≠ j β βj = 1 if h = j T h
G ( Σ1 ,..., Σ k ) p
(
)
p
= g ( Σ1 ,..., Σ k ) − γ h β β h − 1 − 2 γ hjj βTh β j h =1
T h
h< j
• Thus we wish to minimize the function p
p
h =1
h< j
G ( Σ1 ,,...,, Σ k ) = g ( Σ1 ,,...,, Σ k ) − γ h (βTh β h − 1) − 2 γ hj βTh β j
60
Common Principal Components -6 6 • Minimization: βTj S i β j ∂ ni log λij + λij ∂G ( Σ1 ,..., Σ k ) i =1 j =1 = ∂λij ∂λij k
p
=0
λij = βTj S i β j , i = 1,,...,, k ; j = 1,,...,, p Key point! Keep it in your mind.
trace(Σ i−1S i ) = p, i = 1,..., k
61
Common Principal Components -77 • Minimization (cont.): T p k β S β j i j ni log λij + i =1 j =1 λij ∂ p p T T − γ h (β h β h − 1) − 2 γ hj β h β j ∂G ( Σ1 ,..., Σ k ) h =1 h< j =0 = ∂β j ∂β j k
i =1
ni S i β j
λij
p
− γ j β j − γ jh β h = 0, j = 1,..., p h =1 h≠ j
k
γ j = ni , j = 1,..., p – Multiplying M lti l i the th left l ft by b βTh gives i i =1
62
Common Principal Components -8 8 • Minimization (cont.): Thus k
ni S i β j
λij
i =1
k
p
i =1
h =1 h≠ j h≠
−( ni )β j − γ jh β h , j = 1,..., p
– Multiplying the left by βTl (l ≠ j ) implies k
i =1
ni βTl S i β j
λij
=γ jl , j = 1,..., p, l ≠ j
– Note that βTj S i β l = βTl S i β j , and γ jl = γ lj k
i =1 1
ni βTl S i β j
λij
=γ jl , j = 1,..., p, j ≠ l 63
Common Principal Components -9 9 • Minimization (cont.): k
i =1
ni βTl S i β j
λij
k
− i =1
ni βTl S i β j
λil
=0
k λil − λij T β l ni S i β j = 0, l,j = 1,...,p; l ≠ j i =1 λ λ il ijj
the optimization objective
– These p( p − 1) 2 equations have to be solved under the orthonormality conditions βT β = I p and λij = βTj S i β
64
Common Principal Components -10 10 • Solving procedure of CPC – FG Algorithm – F-Algorithm
65
Common Principal Components -11 11 – G-Algorithm
66
Common Principal Components -12 12 • Likelihood Ratio Test – The sample common principal components: U i = βˆ T X i , i = 1,..., k
– For the ith group, the transformed covariance matrix is Fi = βˆ T S i βˆ , i = 1,,...,, k
– Since Λˆ i = diag(Fi ), i = 1,..., k the statistic can be written as a function of the p
| diag(Fi ) | k 2 χ = ni log = ni log | Fi | i =1 i =1 k
∏
alone:
f jj( i )
j =1 p
∏l
ij
j =1
eigenvalues of Fi 67
Common Principal Components -13 13 • Likelihood Ratio Test (cont.) – The likelihood ratio criterion is a measure of simultaneous diagonalizability of k p.d.s. matrices – The CPC's can be viewed as obtained byy a simultaneous transformation, yielding variables that are as uncorrelated as possible – It can also be seen from another viewpoint of Hadamard Hadamard’ss inequality | Fi |≤| diag( g(Fi ) |
68
Common Principal Components -14 14 • Actually, CPC can be also viewed as another measure of “deviation from diagonality” | diag(Fi ) | ϕ (Fi ) = ≥1 | Fi |
– The CPC criterion can be k
Φ(F1 ,..., Fk ; n1,...,nk ) = ∏ (ϕ (Fi )) ni i =1
– Let Fi = BT A i Bfor a given orthonal matrix B Φ 0 ( A1 ,..., A k ; n1,...,n nk ) = min Φ(BT A1B,..., BT A k B; n1,...,nnk ) B
69
Comparison Between CPC and F-MLLT F MLLT • CPC tries to maximize C
−1 n (log | Σ | + trace( Σ i i i S i ))
The estimates are KNOWN
i =1
with Σ i = BT Λ i B, i = 1,..., C in the original space • F-MLLT tries to maximize 1 C T N | diag ( A S j A) | − N log | A | j 2 j =1
The estimates are UNKNOWN
70
Appendix A -11 • Show N
∏ i =1
e
1 − ( x i −μ li )T Σ l−i 1 ( x i −μ li ) 2
(2π ) | Σ li | d
= (2π )
−
Nd 2
e
−
j 2j (( m j −μ j )T Σ −j1 ( m j −μ j ) + trace( Σ −j1S j ) + log|Σ j | ) n
71
Appendix A -2 2 For each class C j ,
(x
N
(xi − μ li )T Σl−i1 (xi − μ li ) = ( x i − m li + m li − μ li ) Σ ( x i − m li + m li − μ li ) T
−1 li
i =1 N
= ((x i − m li ) + (m li − μ li ) ) Σ ((x i − m li ) + (m i − μ li )) T
− m j )T Σ −j1 (m j − μ j )
x i ∈C j
i =1
N
i
T
−1 li
= Σ −j1 (m j − μ j ) (x i − m j )T x i ∈C j
=0
i =1 N
= ((x i − m li )T Σ l−i1 (x i − m li ) + (m li − μ li )T Σ l−i1 (m li − μ li ) + (x i − m li )T Σ l−i1 (m li − μ li ) + (m li − μ li )T Σ l−i1 (x i − m li )) i =1
N
N
= trace(Σ l−i1 (x i − m li )(x i − m li )T + N j (m j − μ j )T Σ −j1 (m j − μ j ) i =1
i =1
N
N
i =1
i =1
= n j trace(Σ −j1S j ) + N j (m j − μ j )T Σ −j1 (m j − μ j ) N
= n j ((m j − μ j )T Σ −j1 (m j − μ j ) + trace(Σ −j1S j )) i =1
72
Appendix B • ML estimators for log p(x1N ,{μ j },{Σ j }) ∂ log p (x1N ,{μ j }, {Σ j }) ∂μ j
= NΣ −j1 (m j − μ j ) = 0
μˆ j = m j
∂ log p (x1N , {μ j }, {Σ j }) ∂ (trace(Σ −j 1S j ) + log | Σ j |) = =0 ∂Σ j ∂Σ j − Σ −j T STj Σ −j T + Σ −j T = 0 ˆ j =Sj Σ 73
Appendix C -11 • Change of Variable Theorem – Consider a one-one mapping g : ℜ n → ℜ n , Y = f ( X) – Equal probability: The probability of falling in a region in space X should be the same as the p probabilityy of falling g in the corresponding region in space Y – Suppose the region dx1dx2 ...dxn maps to the region dA in the Y space Equating probabilities, space. probabilities we have f Y ( y1 ,..., yn )dA = f X ( x1 ,..., xn )dx1...dxn
– The region dA is a hyperparallelepipe described by the vectors (
dy dy dy dy1 dy dy dx1 ,..., n dx1 ), ( 1 dx2 ,..., n dx2 ),..., ( 1 dxn ,..., n dxn ) d1 dx d1 dx d 2 dx d 2 dx d n dx d n dx 74
Appendix C -2 2 • Change of Variable Theorem (cont.) – The hyper-parallelepiped dA can be calculated by dyn dyn dy1 dy1 dx1 ,,...,, dx1 ,,...,, d1 dx d1 dx d1 dx d1 dx dA = = dx1...dxn dyn dyn dy1 dy1 dxn ,..., dxn ,..., d n dx d n dx d n dx ddxn J: the Jacobian of function g
– So,
f Y ( y1 ,..., yn ) | J | dx1...dxn = f X ( x1 ,..., xn )dx1...dxn f Y ( y1 ,..., yn ) =| J |−1 f X ( x1 ,..., xn ) 75
Appendix D -11 • If A is a square matrix, then the minor entry of aij is denoted by Mij and is defined to be the determinant of the submatrix that remains after the i-th row and the j-th column are deleted from A. A The number (−1) ( 1)i + jMij is denoted by cij and is called the cofactor of aij. • Given the matrix b11 b12 B = b21 b22 b31 b32
b13 b23 b33
c33 = (−1) 3+3 ( M 33 ) M 23
c23 = (−1) 2+3 ( M 23 )
b11 b12 × b11 b12 = b11b32 − b12b31 = × × × = b31 b32 b31 b32 × 76
Appendix D -2 2 • Given the n by n matrix – The determinant of A can be written as the sum of its cofactors multiplied by the entries that generated them. a11 a 21 A= an1
a12 a22 an 2
a1n a2 n ann
(cofactor expansion along the jth column) det( A) = a1 j c1 j + a2 j c2 j + a3 j c3 j + + anj cnj = A( j )T c ( j )
((cofactor expansion p along g the ith row)) det( A) = ai1ci1 + ai 2 ci 2 + ai 3ci 3 + + ain cin = A( i ) c(Ti )
77