Feature Decorrelation On Speech Recognition

  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Feature Decorrelation On Speech Recognition as PDF for free.

More details

  • Words: 7,062
  • Pages: 77
Feature Decorrelation on S Speech hR Recognition iti H Hung-Shin Shi Lee L Institute of Information, Sinica Academic D off Electrical Dep. El t i l E Engineering, i i N National ti lT Taiwan i U University i it 2009-10-09 @ IIS, Sinica Academic

References -11 1) 2) 3) 4) 5) 6) 7) 8)

J. Psutka and L. Muller, “Comparison of various feature decorrelation techniques in automatic speech recognition recognition,” in Proc. Proc CITSA 2006. 2006 Dr. Berlin Chen’s Lecture Slides: http://berlin.csie.ntnu.edu.tw Batlle et al., “Feature decorrelation methods in speech recognition - a comparative study, study,” in Proc. ICSLP 1998. K. Demuynck et al., “Improved feature decorrelation for HMM-based speech recognition,” in Proc. ICSLP 1998. W. Krzanowski, Principles of Multivariate Analysis - A User’s Perspective, Oxford Press, 1988. R. Gopinath, “Maximum likelihood modeling with Gaussian distributions for classification,” in Proc. ICASSP 1998. M. Gales, “Semi-tied covariance matrices for hidden Markov models,” IEEE Trans. on Speech and Audio Processing, vol. 7, no. 3, pp. 272-281, 1999. P. Olsen and R. Gopinath, “Modeling inverse covariance matrices by basis expansion ” in Proc. expansion,” Proc ICASSP 2002. 2002 2

References -2 2 9) 10) 11) 12) 13) 14)

P. Olsen and R. Gopinath, “Modeling inverse covariance matrices by basis expansion ” IEEE Trans. expansion, Trans on Speech and Audio Processing, Processing vol. vol 12, 12 no. no 11, 11 pp. pp 37-46, 2004. N. Kumar and R. Gopinath, “Multiple linear transform,” in Proc. ICASSP 2001. N. Kumar and A. Andreou, “Heteroscedastic Heteroscedastic discriminant analysis and. reduced rank HMMs for improved speech recognition,” Speech Communication, vol. 26, no. 4, pp. 283-297, 1998. A. Ljolje, “The importance of cepstral parameter correlations in speech recognition,” Computer Speech and Language, vol. 8, pp. 223–232, 1994. B. Flury, “Common principal components in k groups,” Journal of the American Statistical Association, vol. 79, no. 388, 1984. B. Flury and W. Gautschi, “An algorithm for simultaneous orthogonal transformation of several positive definite symmetric matrices to nearly diagonal form,” SIAM Journal on Scientific and Statistical Computing, vol. 7, no 1, no. 1 1986. 1986 3

Outline • Introduction to Feature Decorrelation (FD) – FD on Speech Recognition

• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT) • Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT) • Common Principal Component (CPC) 4

Outline • Introduction to Feature Decorrelation (FD) – FD on Speech Recognition

• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT) • Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT) • Common Principal Component (CPC) 5

Introduction to Feature Decorrelation -11 • Definition of Covariance Matrix: – Random vector: X = [ x1  xn ]T random variable

– Covariance matrix and mean vector: Σ ≡ E[( X − μ)( X − μ)T ] expected value

random vector

μ ≡ E[X] 6

Introduction to Feature Decorrelation -2 2 • Feature Decorrelation (FD) – To find transformations Θ that make all variables or parameters (nearly) uncorrelated ~ ~ ~ ~ cov( X i , X j ) = 0, ∀X i , X j , i ≠ j transformed random variable

– Or make the covariance matrix diagonal (not necessary identity) ΘT ΣΘ = D covariance matrix

diagonal matrix

The covariance matrix can be global or depend on each class. 7

Introduction to Feature Decorrelation -33 • Why Feature Decorrelation (FD)? – In many speech recognition systems, the observation density function for each HMM state are modeled as mixtures of diagonal g covariance Gaussians – For the sake of computational simplicity, the off-diagonal elements of the covariance matrices of the Gaussians are assumed to be close to zero f (x) =  π i N (x; μ i , Σ i ) i

observation density function

1  1  T −1 exp ( x μ ) Σ ( x μ ) = π i − − −  i i i  12 12 (2π ) | Σ i |  2  i diagonal matrix 8

FD on Speech Recognition -11 • Approaches for FD can be divided into two categories: Feature-space and Model-space • Feature-space Schemes – Hard to find a single transform which decorrelates all elements of the feature vector for all states

• Model-space Schemes – A different transform is selected depending on which component the observation was hypothesized to be generated from – In the limit a transform may be used for each component, which hi h iis equivalent i l t tto a ffullll covariance i matrix t i system t 9

FD on Speech Recognition -2 2 • Feature-space Schemes on LVCSR DCT, PCA, LDA, MLLT, … Feature-space Decorrelation Speech Signal

Front-End Preprocessing

Test Data

Speech Decoding

Textual Results

Training Data

AM Training

AM

LM

Lexicon

10

FD on Speech Recognition -33 • Model-space Schemes on LVCSR Speech Signal

Front-End Preprocessing

Test Data

Speech Decoding

Textual Results

Training Data

AM Training g

AM

LM

Lexicon

Model-space p Decorrelation MLLT,, EMLLT,, MLT,, Semi-Tied,, … 11

FD on Speech Recognition -4 4 Without Label Information

With Label Information

Discrete Cosine Transform (DCT)

Linear Discriminant Analysis (LDA)

Principal Component Analysis (PCA)

Common Principal Components (CPC)

Feature-Space F t S Schemes

Extended Maximum Likelihood Linear Transform (EMLLT) Semi-Tie Covariance Matrices Multiple Linear Transforms (MLT)

Model-Space S h Schemes

Maximum Linear Likelihood Transform (MLLT)

12

Outline • Introduction to Feature Decorrelation (FD) – FD on Speech Recognition

• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT) • Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT) • Common Principal Component (CPC) 13

Discrete Cosine Transform -11 • Discrete Cosine Transform (DCT) – Since the log-power spectrum is real and symmetric, the inverse DFT reduces to a Discrete Cosine Transform (DCT) – Is Applied pp to the log-energies g g of output p filters (Mel-scaled ( filterbank) during the MFCC parameterization 2 n π × j  cj = xi cos (i − 0.5) , ffor j = 0,1,..., m < n  n i =1  n  i-th coordinate of the input vector x j-th coordinate of the output vector c

– Partial decorrelation 14

Discrete Cosine Transform -2 2 • (Total) covariance matrix calculated using the 25-hour training data from MATBN 18 Mel-scaled Filterbank

133 Mel-cepstrum p ((using g DCT))

2

8 6

15 1.5

4 1 2 0.5

0

0 20

-2 40 15

20 15

10

10

5

5 0

0

30

40 30

20

20

10

10 0

0

15

Principal Component Analysis -11 • Principal Component Analysis (PCA) – Based on the calculation of the major directions of variations of a set of data points in a high dimension space – Extracts the direction of the g greatest variance,, assuming g that the less variation of the data, the less information it carries of the features – Principal components V = [ v1 ,..., v p ] : the largest eigenvectors of the total covariance matrix Σ

Σv i = λi v i  V T ΣV = D diagonal matrix 16

Principal Component Analysis -2 2 • (Total) covariance matrix calculated using the 25-hour training data from MATBN 162 Spliced Mel Mel-Filterbank Filterbank

39 PCA Transformed Subspace

2

150

15 1.5

100

1 50 0.5 0

0 -0.5 200

-50 40 150

200 150

100

100

50

50 0

0

30

40 30

20

20

10

10 0

0

17

Linear Discriminant Analysis -11 • Linear Discriminant Analysis (LDA) – Seeks a linear transformation matrix Θ that satisfies

(

Θ = arg max trace (ΘT SW Θ) −1 (ΘT S B Θ) Θ

)

– Θ is formed by the largest eigenvectors of SW−1S B – The LDA subspace is not orthogonal (but PCA subspace is)

ΘT Θ ≠ I – The LDA subspace makes the transformed f variables statistically uncorrelated.

18

Linear Discriminant Analysis -2 2 • From any two distinct eigenvalue/eigenvector pairs (λi , θ i ) and (λ j , θ j ) S B θ i = λi SW θ i  S B θ j = λ j SW θ j

– Pre-multiplying p y g byy θTj and θTi , respectively p y

θTj S B θ i = λi θTj SW θ i θTj S B θ i =θTi S B θ j T T ⎯ ⎯ ⎯ ⎯ ⎯ → λ θ S θ = λ θ  T i j W i j i SW θ j T θ i S B θ j = λ j θ i SW θ j θTj SW θ i =θTi SW θ j and λi ≠ λ j

⎯⎯ ⎯ ⎯ ⎯ ⎯ ⎯⎯→ θTj SW θ i = θTi SW θ j = 0 ∴ θTi SW θ j = 0 for f ∀i≠ j Ref. (Krzanowski, 1988)

19

Linear Discriminant Analysis -33 • To overcome arbitrary scaling of θ i , it is usual to adopt the normalization θTi SW θ i = 1  θTi S B θ i = λi θTi SW θ i = λi

∴ Θ T SW Θ = I , Θ T S B Θ = Λ

λ1 0 0    Λ = 0  0   0 0 λp   

• The total covariance matrix ST = SW + S B is also transformed as a diagonal matrix

20

Linear Discriminant Analysis -4 4 • (Total) covariance matrix calculated using the 25-hour training data from MATBN 162 Spliced p Mel-Filterbank

39 LDA Transformed Subspace p

2

15

1.5

10

1 5 0.5 0

0 -0.5 200

-5 40 150

200 150

100

100

50

50 0

0

30

40 30

20

20

10

10 0

0

21

Maximum Likelihood Linear Transform • Maximum Likelihood Linear Transform (MLLT) seems to have two types: – Feature-based: (Gopinath, 1998) – Model Model-based: based: (Olsen, (Olsen 2002) 2002), (Olsen, (Olsen 2004)

• The common goal of the two types – Find a global linear transformation matrix that decorrelate features.

22

Feature-based Feature based MLLT -11 • Feature-based Maximum Likelihood Linear Transform (F-MLLT) – Tries to alleviate four problems: (a) Data insufficiency implying unreliable models (b) Large storage requirement (c) Large computational requirement (d) ML is i nott discriminating di i i ti between b t classes l – Solutions (a)-(c): Sharing parameters across classes (d): Appealing to LDA

23

Feature-based Feature based MLLT -2 2 • Maximum Likelihood Modeling – The likelihood of the training data {(x i , li )} is given by N

N

i =1

i =1

p (x , {μ li }, } {Σ li }) = ∏ p (x i , {μ li }, } {Σ li }) = ∏ N 1

Log-Likelihood g

= (2π )



Nd 2

e



e

1 − ( x i −m li )T Σ l−i 1 ( x i −m li ) 2

(2π ) d | Σ li |

Appendix A

 j 2j (( m j −μ j )T Σ −j1 (m j −μ j ) + trace( Σ −j1S j ) + log|Σ j | ) n

Nd log p (x , {μ j }, {Σ j }) = − log(2π ) 2 C nj − (m j − μ j )T Σ −j1 (m j − μ j ) + trace(Σ −j1S j ) + log | Σ j | j =1 2 N 1

(

sample mean

)

ML estimators sample covariance 24

Feature-based Feature based MLLT -33 • The idea of maximum likelihood estimation (MLE) is to choose the parameters {μˆ j } and {Σˆ j } so as to maximize log p(x1N ,{μ j }, {Σ j })

ˆ } – ML Estimators: {μˆ j } and {Σ j

What’s the difference between estimator and “estimate”? estimate ? “estimator” 25

Feature-based Feature based MLLT -4 4 • Multiclass ML Modeling

– The training data is modeled with Gaussians (μˆ j , Σˆ j ) – The ML estimators: μˆ j = m j , Σˆ j = S j Appendix B – The log-ML g value: ˆ j }) log p ∗ (x1N ) = log p(x1N , {μˆ j }, {Σ C C Nd nj nj (llog((2π ) + 1) −  | S j | = g ( N , d ) −  | S j | =− 2 j =1 2 j =1 2

– There is “no interaction” between the classes and therefore f unconstrained ML modeling is not “discriminating”

26

Feature-based Feature based MLLT -55 • Constrained ML – Diagonal Covariance – The ML estimators: μˆ j = m j , Σˆ j = diag(S j ) – The log-ML value:

log p

∗ diag

C

nj (x ) = g ( N , d ) −  | diag(S j ) | j =1 2 N 1

– If one linearly transforms the data and models using a diagonal Gaussian, the ML value is

log p

∗ diag

How to choose Θ for each class? Appendix C

C

(

nj (x ) = g ( N , d ) −  | diag(ΘTj S j Θ j ) | + log | Θ j | j =1 2 N 1

)

the Jacobian 27

Feature-based Feature based MLLT -6 6 • Multiclass ML Modeling – Some Issues – If the sample size for each class is not large enough then the ML parameter estimates may have large variance and hence be unreliable – The storage requirement for the model: O(Cd 2 ) – The computational requirement: O(Cd 2 ) Why?

– The parameters for each class are obtained independently: ML principle dose not allow for f discrimination between classes

28

Feature-based Feature based MLLT -77 • Multiclass ML Modeling – Some Issues (cont.) – Parameters sharing across classes: reduces the number of parameters, storage requirements, and computational requirements q – Hard to justify parameters sharing is more discriminating – We can appeal to Fisher Fisher’ss criterion of LDA and a result of Campbell to argue that sometimes constrained ML modeling is discriminating. But, what is discriminating?

29

Feature-based Feature based MLLT -8 8 • Multiclass ML Modeling – Some Issues (cont.) – We can globally transform the data with a unimodular matrix Θ and model the transformed data with diagonal Gaussians ((There is a loss in likelihood too)) det(Θ) = 1 – Among all possible transformation Θ, we can choose the one that takes the least loss in likelihood (In essence we will find a linearly transformed (shared) feature space in which the diagonal Gaussian assumption is almost valid)

30

Feature-based Feature based MLLT -9 9 • Another constrained MLE with sharing of parameters – Equal Covariance – Clustering

• Covariances Diagonalization and Cluster Transformation – Classes are grouped into clusters – Each E h cluster l t iis modeled d l d with ith a diagonal di l Gaussian G i in i a transformed feature space – The ML estimators: μˆ j = m j , Σˆ j = Θ C−T diag(Θ C S j ΘTC )Θ C−1 – The ML value: j

logg p

∗ diag

C

(

j

j

j

nj (x ) = g ( N , d ) −  | diag( g(Θ C j S j ΘTC j ) | + logg | Θ C j | j =1 2 N 1

) 31

Feature-based Feature based MLLT -10 10 • One Cluster with Diagonal Covariance – When the number of clusters is one, there is single global transformation and the classes are modeled as diagonal Gaussians in this feature space p – The ML estimators: μˆ j = m j , Σˆ j = Θ −T diag(ΘT S j Θ)Θ −1 – The log-ML value: C

nj | diag(ΘT S j Θ) | + N log | Θ | j =1 2

∗ log pdiag (x1N ) = g ( N , d ) − 

– The optimal Θ can be obtained by optimization as follows f  C nj  T Θ = arg min  | diag(Θ S j Θ) | − N log | Θ |  Θ  j =1 2  32

Feature-based Feature based MLLT -11 11 • Optimization – the numerical approach: – The objective function C

nj | diag(ΘT S j Θ) | − N log | Θ | j=1 2

F (Θ) = 

– Differentiate F with respect to A, and get the derivative G:

(

)

G (Θ) = NΘ −T −  n j diag(ΘS j ΘT ) ΘS j j

−1

– Directly optimizing the objective function f is nontrivial and requires numerical optimization techniques and full matrix to be stored at each class 33

Outline • Introduction to Feature Decorrelation (FD) – FD on Speech Recognition

• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT) • Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT) • Common Principal Component (CPC) 34

Model-based Model based MLLT -11 • In model-based MLLT, instead of having a distinct covariance matrix for every component in the recognizer, each covariance matrix consists of two elements: – A non non-singular singular linear transformation matrix Θ shared over a set of components – The diagonal elements in the matrix Λ j for each class j

• The precision matrices are constrained to be of the form d

P j = Σ −j1 = ΘT Λ j Θ =  λkj θ k θTk k =1

precision matrix Ref. (Olsen, 2004)

diagonal matrix

λ1j 0 0    Λj = 0  0   0 0 λdj    35

Model-based Model based MLLT -2 2 • M-MLLT fits within the standard maximum-likelihood criterion used for training HMM's with each state represented by a GMM m

f (x | Θ) =  π j N (x; μ j , Σ j ) j =1

j th componentt weight j-th i ht

• Log-likelihood function: 1 L(Θ : X) = N

N

 log f (x

i

| Θ)

i =1

36

Model-based Model based MLLT -33 • Parameters of the M-MLLT model for each HMM state Θ = {π 1m , μ1m , Λ1m , Θ} are estimated using a generalized expectation-maximization (EM) algorithm • The auxiliary “Q “Q-function” function” should be introduced introduced. – The Q-function satisfies the inequality

L(Θ : X) − L(Θˆ : X) ≥ Q(Θ, Θˆ ) − Q(Θˆ , Θˆ ) Θ = {π 1m , μ1m , Λ1m , Θ} → New  ˆ m m ˆ m ˆ ˆ ˆ Θ = { π , μ , Λ1 , Θ} → Old 1 1 

* A. P. Dempster et al., “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Statist. Soc. B, vol. 39, pp. 1–38, 1977.

37

Model-based Model based MLLT -4 4 • The auxiliary Q-function is 1 ˆ Q(Θ; Θ) = N 1 = N

N

m

N

m

 γ

ij

L j (x i )

i =1 j =1

γ ij

 2 i =1 j =1

The component log-likelihood of the observation i

L j (x i ) = log(π j N (x i ; μ j , Σ j ))

× (2 log π j − d log(2π ) + log | P j | − trace(P j (x i − μ j )(x i − μ j )T )

A posteriori probability of Gaussian component j given the observation i

γ ij =

exp( Lˆ j ( xi )) m

p( Lˆ ( x ))  exp( k

i

k =1

38

Model-based Model based MLLT -55 • Gales’ Approach for deriving Θ  Estimate the mean and the component weight, which are independent of the other model parameters 1 πj = N

N

γ i =1

N

ij

μ j =  γ ij x i i =1

N

γ

ij

i =1

 Use the current estimate of the transform Θ, and estimate the set of class-specific diagonal variances

λij = θ i S j θTi the i-th entry of diagonal variance of component j Ref. (Gales, 1999)

the i-th row vector of Θ

39

Model-based Model based MLLT -6 6 • Gales’ Approach for deriving Θ (cont.)  Estimate the transform Θ using the current set of diagonal covariances θˆ i = c i G i−1 Nj G i =  ( j )2 S j j σˆ diag i

N c i G i−1cTi The i-th row vector of the cofactors of the current Θ Appendix D

 Go to step 2 until convergence, or appropriate criterion satisfied 40

Semi-Tied Semi Tied Covariance Matrices -11 • Introduction to Semi-Tied Covariance Matrices – A natural extension of the state-specific rotation scheme – The transform is estimated in a maximum-likelihood (ML) fashion g given the current model parameters p – The optimization is performed using a simple iterative scheme, which is guaranteed to increase the likelihood of the training data – An alternative approach to solve the optimization problem of DHLDA

41

Semi-Tied Semi Tied Covariance Matrices -2 2 • State-Specific Rotation – A full covariance matrix is calculated for each state in the system, and is decomposed into its eigenvectors and eigenvalues g s) Σ (full = U ( s ) Λ ( s ) U ( s )T

– All data from that state is then decorrelated using the eigenvectors calculated ο ( s ) (τ ) = U ( s )T ο(τ )

– Multiple diagonal covariance matrix Gaussian components are then trained Ref. (Gales, 1999), (Kumar, 1998)

42

Semi-Tied Semi Tied Covariance Matrices -33 • Drawbacks of State-Specific Rotation – It does not fit within the standard ML estimation framework for training HMM’s – The transforms are not related to the multiple-component models being used to model the data

43

Semi-Tied Semi Tied Covariance Matrices -4 4 • Semi-Tied Covariance Matrices – Instead of having a distinct covariance matrix for every component in the recognizer, each covariance matrix consists of two elements m) Σ ( m ) = H ( r ) Σ (diag H ( r )T

component specific diagonal covariance element semi-tied class-dependent, non-diagonal matrix, may be tied over a set of components 44

Semi-Tied Semi Tied Covariance Matrices -55 • Semi-Tied Covariance Matrices (cont.) – It is very complex to optimize these parameters directly so an expectation-maximization approach is adopted – Parameters for each component m m) (r ) M = {π ( m ) , μ ( m ) , Σ (diag , Θ } d

Θ ( r ) = H ( r ) −1

45

Semi-Tied Semi Tied Covariance Matrices -6 6 • The auxiliary Q-function Q( M , Mˆ ) =  |Θ ˆ ( r ) |2  ( m ) T ˆ ( r )T ˆ ( m ) −1 ˆ ( r ) (m)    ˆ ˆ γ ( τ ) log − ο τ − μ Θ Σ Θ ο τ − μ ( ) ( ) m diag  (m)   ˆ  m∈M ( r ),τ   | Σ diag | 

(

p (qm (τ ) | M , ΟT )

| Σ(m) |

)

(

Σ ( m )−1

component m at time τ

46

   

)

Semi-Tied Semi Tied Covariance Matrices -77 • If all the model parameters are to be simultaneously optimized then the Q-function may be rewritten as ˆ ( r ) |2   | Θ ˆ   − dβ Q( M , M ) =  γ m (τ ) log l  ˆ ( r ) W ( m )Θ ˆ ( r )T ) |  | diag ( Θ m∈M ( r ),τ  

where W

(m)

=

τ γ

m

(

(τ ) ο(τ ) − μˆ

τ γ

(m)

m

)(ο(τ ) − μˆ )

(τ )

(m) T

μˆ ( m )

τ γ (τ )ο(τ ) = τ γ (τ ) m

m

β=

 γτ

m

(τ )

m∈M ( r ),

47

Semi-Tied Semi Tied Covariance Matrices -8 8 • The ML estimate of the diagonal element of the covariance matrix is given by

(

m) (r ) ( m ) ˆ ( r )T ˆ ˆ (diag Σ = diag Θ W Θ g

)

• The reestimation formulae for the component weights and transition probabilities are identical to the standard HMM cases (Rabiner, 1989) • Unfortunately, optimizing the new Q-function is nontrivial and more complicated, so an alternative approach h iis proposed d nextt 48

Semi-Tied Semi Tied Covariance Matrices -9 9 • Gales’ approach for optimizing the Q-function  Estimate the mean, which is independent of the other model parameters  Using the current estimate of the semi-tied transform Θˆ ( r ) m) ˆ ( r ) W ( m )Θ ˆ ( r )T ) estimate the set of component and Σˆ (diag = diag(Θ specific diagonal variances. variances This set of parameters will be r) r) ˆ (diag denoted as {Σˆ (diag } = {Σ , m ∈ M (r )} r) ˆ ( r ) using the current set {Σˆ (diag  Estimate the transform f } Θ

 Go to ((2)) until convergence, g , or appropriate pp p criterion satisfied 49

Semi-Tied Semi Tied Covariance Matrices -10 10 • How to carry out step (3)? – Optimizing the semi-tied transform requires an iterative estimation scheme even after fixing all other model p parameters – Selecting a particular row of Θˆ ( r ) , θˆ i( r ) , and rewriting the r) } former Q-function using the current set {Σˆ (diag ˆ Q( M , Mˆ ;{Σ

(r ) diag

}

ˆ (r ) | |Θ

o ( m ) (τ ) − μˆ ( m )

( r )T ( m ) 2  ˆ  ˆ ( ( τ )) θ ο j m) ˆ (diag  l (θˆ i( r )T ci ) 2 − log l |Σ | − =  γ m (τ ) log( ( m ) 2   σ m∈M ( r ),τ j diag j  

The ith row vector of the cofactors

the leading diagonal element j 50

Semi-Tied Semi Tied Covariance Matrices -11 11 • How to carry out step (3)? (cont.) – It is shown that the ML estimate for the ith row of the semi-tied transform, θˆ i( r ) , is given by θˆ i( r ) = c i G ( ri ) −1

G ( ri ) =

β c i G ( ri ) −1cTi



m∈M ( r )

1

σˆ

(m)2 diag i

W ( m )  γ m (τ ) τ

51

Semi-Tied Semi Tied Covariance Matrices -12 12 • It can be shown that r) ˆ (diag Q( M , Mˆ ) ≥ Q( M , Mˆ ;{Σ })

with equality when diagonal elements of the covariance m) ˆ ( r ) W ( m )Θ ˆ ( r )T ) matrix are given by Σˆ (diag = diag(Θ • During recognition the log-likelihood is based on log( L(ο(τ ); μ ( m ) , Σ ( m ) , Θ ( r ) )) m) = log( N (ο ( r ) (τ ); Θ ( r )μ ( m ) , Σ (diag )) + log | Θ ( r ) |

Θ ( r ) T ο(τ )

1 2 log | Θ ( r ) |2 52

Extended MLLT -11 • The extended MLLT (EMLLT) is much similar to M-MLLT. But the only difference is the precision matrix modeling D

P j = Σ = Θ Λ j Θ =  λkj θ k θTk −1 j

T

k =1

basis

– – – –

Note that in EMLLT, D ≥ d , while in M-MLLT, D = d λkj are not required to be positive! {λkj } have h tto b be chosen h such h th thatt P j is i positive iti definite d fi it The authors provided two algorithms for iterative-update of the parameters

Ref. (Olsen, 2002), (Olsen, 2004)

53

Extended MLLT -2 2 • The highlight of EMLLT: – More flexible: the number of basis elements {θ k θTk } can be gradually varied from d (MLLT) to d(d+1)/2 (Full) by controlling the value of D

54

Outline • Introduction to Feature Decorrelation (FD) – FD on Speech Recognition

• Feature-based Decorrelation (DCT, PCA, LDA, F-MLLT) • Model-based Decorrelation (M-MLLT, EMLLT, Semi-tied covariance matrix, MLT) • Common Principal Component (CPC) 55

Common Principal Components -11 • Why Common Principal Components (CPC)? – We often deal with the situation of the same variables being measured on objects from different groups, and the covariance structure mayy varyy from g group p to g group p – But sometimes the covariance matrices of different groups look somehow similar, and it seems reasonable to assume that the covariance matrices have a common basic structure

• Goal of CPC – To find a rotation diagonalizes the covariance matrices simultaneously

Ref. (Flury, 1984), (Flury, 1986)

56

Common Principal Components -2 2 • Hypothesis of CPC’s H c : BT Σ i B = Λ i , i = 1,..., k diagonal matrix

– The common principal components (CPC's): U i = BT x i

– Note that, no canonical ordering of the columns of B need be given since the rank order of the diagonal elements of the given, is not necessarily the same for all groups

57

Common Principal Components -33 • Assume ni S i are independently distributed as W p (ni , Σ i ) . The common likelihood function is   ni −1   L( Σ1 ,..., Σ k ) = C × ∏ exp trace − Σ i S i   Σ i  2  i =1  k

− ni 2

• Instead of maximizing the likelihood function, minimize g ( Σ1 ,..., Σ k ) = −2 log L( Σ1 ,..., Σ k ) + 2 log C k

=  ni (log | Σ i | + trace(Σ i−1S i )) i =1

58

Common Principal Components -4 4 • Assume H C holds for some orthogonal matrix β , and Λ i = diag(λi1 ,..., λip ) , then p

log | Σ i |=  log λij , i = 1,...,kk j =1

p

trace(Σ i−1S i ) = trace(βΛ i−1βT S i ) = trace(Λ i−1βT S i β) =  j =1

βTj S i β j

λij

• Therefore g ( Σ1 ,..., Σ k ) = g (β1 ,..., β p , λ11 ,..., λ1 p , λ21 ,..., λkp )  βTj S i β j =  ni   log λij +  λijj i =1 j =1  k

p

    59

Common Principal Components -55 • The function g is to be minimized under the restrictions 0 if h ≠ j β βj =  1 if h = j T h

G ( Σ1 ,..., Σ k ) p

(

)

p

= g ( Σ1 ,..., Σ k ) −  γ h β β h − 1 − 2 γ hjj βTh β j h =1

T h

h< j

• Thus we wish to minimize the function p

p

h =1

h< j

G ( Σ1 ,,...,, Σ k ) = g ( Σ1 ,,...,, Σ k ) −  γ h (βTh β h − 1) − 2 γ hj βTh β j

60

Common Principal Components -6 6 • Minimization:  βTj S i β j ∂  ni   log λij + λij ∂G ( Σ1 ,..., Σ k ) i =1 j =1  = ∂λij ∂λij k

p

   =0

 λij = βTj S i β j , i = 1,,...,, k ; j = 1,,...,, p Key point! Keep it in your mind.

 trace(Σ i−1S i ) = p, i = 1,..., k

61

Common Principal Components -77 • Minimization (cont.): T p   k   β S β j i j   ni   log λij +    i =1 j =1   λij  ∂  p p   T T  −  γ h (β h β h − 1) − 2 γ hj β h β j  ∂G ( Σ1 ,..., Σ k ) h =1 h< j  =0 =  ∂β j ∂β j k

 i =1

ni S i β j

λij

p

− γ j β j −  γ jh β h = 0, j = 1,..., p h =1 h≠ j

k

γ j =  ni , j = 1,..., p – Multiplying M lti l i the th left l ft by b βTh gives i i =1

62

Common Principal Components -8 8 • Minimization (cont.): Thus k



ni S i β j

λij

i =1

k

p

i =1

h =1 h≠ j h≠

−( ni )β j −  γ jh β h , j = 1,..., p

– Multiplying the left by βTl (l ≠ j ) implies k

 i =1

ni βTl S i β j

λij

=γ jl , j = 1,..., p, l ≠ j

– Note that βTj S i β l = βTl S i β j , and γ jl = γ lj k

 i =1 1

ni βTl S i β j

λij

=γ jl , j = 1,..., p, j ≠ l 63

Common Principal Components -9 9 • Minimization (cont.): k

 i =1

ni βTl S i β j

λij

k

− i =1

ni βTl S i β j

λil

=0

k  λil − λij  T  β l  ni S i β j = 0, l,j = 1,...,p; l ≠ j   i =1 λ λ il ijj  

the optimization objective

– These p( p − 1) 2 equations have to be solved under the orthonormality conditions βT β = I p and λij = βTj S i β

64

Common Principal Components -10 10 • Solving procedure of CPC – FG Algorithm – F-Algorithm

65

Common Principal Components -11 11 – G-Algorithm

66

Common Principal Components -12 12 • Likelihood Ratio Test – The sample common principal components: U i = βˆ T X i , i = 1,..., k

– For the ith group, the transformed covariance matrix is Fi = βˆ T S i βˆ , i = 1,,...,, k

– Since Λˆ i = diag(Fi ), i = 1,..., k the statistic can be written as a function of the p

| diag(Fi ) | k 2 χ =  ni log =  ni log | Fi | i =1 i =1 k



alone:

f jj( i )

j =1 p

∏l

ij

j =1

eigenvalues of Fi 67

Common Principal Components -13 13 • Likelihood Ratio Test (cont.) – The likelihood ratio criterion is a measure of simultaneous diagonalizability of k p.d.s. matrices – The CPC's can be viewed as obtained byy a simultaneous transformation, yielding variables that are as uncorrelated as possible – It can also be seen from another viewpoint of Hadamard Hadamard’ss inequality | Fi |≤| diag( g(Fi ) |

68

Common Principal Components -14 14 • Actually, CPC can be also viewed as another measure of “deviation from diagonality” | diag(Fi ) | ϕ (Fi ) = ≥1 | Fi |

– The CPC criterion can be k

Φ(F1 ,..., Fk ; n1,...,nk ) = ∏ (ϕ (Fi )) ni i =1

– Let Fi = BT A i Bfor a given orthonal matrix B Φ 0 ( A1 ,..., A k ; n1,...,n nk ) = min Φ(BT A1B,..., BT A k B; n1,...,nnk ) B

69

Comparison Between CPC and F-MLLT F MLLT • CPC tries to maximize C

−1 n (log | Σ | + trace( Σ  i i i S i ))

The estimates are KNOWN

i =1

with Σ i = BT Λ i B, i = 1,..., C in the original space • F-MLLT tries to maximize 1 C T N | diag ( A S j A) | − N log | A |  j 2 j =1

The estimates are UNKNOWN

70

Appendix A -11 • Show N

∏ i =1

e

1 − ( x i −μ li )T Σ l−i 1 ( x i −μ li ) 2

(2π ) | Σ li | d

= (2π )



Nd 2

e



 j 2j (( m j −μ j )T Σ −j1 ( m j −μ j ) + trace( Σ −j1S j ) + log|Σ j | ) n

71

Appendix A -2 2 For each class C j ,

 (x

N

 (xi − μ li )T Σl−i1 (xi − μ li ) =  ( x i − m li + m li − μ li ) Σ ( x i − m li + m li − μ li ) T

−1 li

i =1 N

=  ((x i − m li ) + (m li − μ li ) ) Σ ((x i − m li ) + (m i − μ li )) T

− m j )T Σ −j1 (m j − μ j )

x i ∈C j

i =1

N

i

T

−1 li

= Σ −j1 (m j − μ j )  (x i − m j )T x i ∈C j

=0

i =1 N

=  ((x i − m li )T Σ l−i1 (x i − m li ) + (m li − μ li )T Σ l−i1 (m li − μ li ) + (x i − m li )T Σ l−i1 (m li − μ li ) + (m li − μ li )T Σ l−i1 (x i − m li )) i =1

N

N

=  trace(Σ l−i1 (x i − m li )(x i − m li )T +  N j (m j − μ j )T Σ −j1 (m j − μ j ) i =1

i =1

N

N

i =1

i =1

=  n j trace(Σ −j1S j ) +  N j (m j − μ j )T Σ −j1 (m j − μ j ) N

=  n j ((m j − μ j )T Σ −j1 (m j − μ j ) + trace(Σ −j1S j )) i =1

72

Appendix B • ML estimators for log p(x1N ,{μ j },{Σ j }) ∂ log p (x1N ,{μ j }, {Σ j }) ∂μ j

= NΣ −j1 (m j − μ j ) = 0

 μˆ j = m j

∂ log p (x1N , {μ j }, {Σ j }) ∂ (trace(Σ −j 1S j ) + log | Σ j |) = =0 ∂Σ j ∂Σ j  − Σ −j T STj Σ −j T + Σ −j T = 0 ˆ j =Sj Σ 73

Appendix C -11 • Change of Variable Theorem – Consider a one-one mapping g : ℜ n → ℜ n , Y = f ( X) – Equal probability: The probability of falling in a region in space X should be the same as the p probabilityy of falling g in the corresponding region in space Y – Suppose the region dx1dx2 ...dxn maps to the region dA in the Y space Equating probabilities, space. probabilities we have f Y ( y1 ,..., yn )dA = f X ( x1 ,..., xn )dx1...dxn

– The region dA is a hyperparallelepipe described by the vectors (

dy dy dy dy1 dy dy dx1 ,..., n dx1 ), ( 1 dx2 ,..., n dx2 ),..., ( 1 dxn ,..., n dxn ) d1 dx d1 dx d 2 dx d 2 dx d n dx d n dx 74

Appendix C -2 2 • Change of Variable Theorem (cont.) – The hyper-parallelepiped dA can be calculated by dyn dyn dy1 dy1 dx1 ,,...,, dx1 ,,...,, d1 dx d1 dx d1 dx d1 dx dA =  =  dx1...dxn dyn dyn dy1 dy1 dxn ,..., dxn ,..., d n dx d n dx d n dx ddxn J: the Jacobian of function g

– So,

f Y ( y1 ,..., yn ) | J | dx1...dxn = f X ( x1 ,..., xn )dx1...dxn  f Y ( y1 ,..., yn ) =| J |−1 f X ( x1 ,..., xn ) 75

Appendix D -11 • If A is a square matrix, then the minor entry of aij is denoted by Mij and is defined to be the determinant of the submatrix that remains after the i-th row and the j-th column are deleted from A. A The number (−1) ( 1)i + jMij is denoted by cij and is called the cofactor of aij. • Given the matrix b11 b12 B = b21 b22 b31 b32

b13  b23  b33 

c33 = (−1) 3+3 ( M 33 ) M 23

c23 = (−1) 2+3 ( M 23 )

b11 b12 × b11 b12    = b11b32 − b12b31 =  × × × =   b31 b32   b31 b32 × 76

Appendix D -2 2 • Given the n by n matrix – The determinant of A can be written as the sum of its cofactors multiplied by the entries that generated them.  a11 a 21 A=    an1

a12 a22  an 2

 a1n   a2 n       ann 

(cofactor expansion along the jth column) det( A) = a1 j c1 j + a2 j c2 j + a3 j c3 j +  + anj cnj = A( j )T c ( j )

((cofactor expansion p along g the ith row)) det( A) = ai1ci1 + ai 2 ci 2 + ai 3ci 3 +  + ain cin = A( i ) c(Ti )

77

Related Documents