Diplomarbeit

  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Diplomarbeit as PDF for free.

More details

  • Words: 5,276
  • Pages: 20
Technical University Carolo-Wilhelmina, Braunschweig Institute for Communications Technology Schleinitzstraße 22, 38106 Braunschweig

Master Thesis

Title of the Thesis

Michele Sanna

July 2008 Supervisor: Patrick Bauer

Abstract Text of the Abstract

iii

Contents Abstract

iii

Abstract

iv

1 Introduction 1.1

1

Underchapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.1.1

1

Title . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 The Artificial Bandwidth Extension Algorithm 2.1

3

Features Extraction and Observation Probabilities . . . . . . . . . . . . . . . . . . . .

4

2.1.1

Features Extraction with composite vector . . . . . . . . . . . . . . . . . . . . .

5

2.1.2

Features Extraction with AURORA . . . . . . . . . . . . . . . . . . . . . . . .

7

2.1.3

Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.1.4

Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.2

Hidden Markov Model and A Posteriori Probabilities . . . . . . . . . . . . . . . . . . .

8

2.3

Estimation of the missing band Envelope . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.4

weiteres Unterkapitel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.4.1

framing 160 40 40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.5

Insertion of Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.6

Formulas

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.7

Zitate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

3 Conclusions

14

Abbreviations

15

Literature

16

iv

Chapter 1

Introduction Text of the Introduction

1.1

Underchapter

text im Unterkapitel

1.1.1

Title

Text ...es geh auch noch tiefer Verschachtelt (genaue Befehle hierzu sind in dem Tutorial beschrieben, welches ich Dir gegeben habe)

1

1.1.1. Title

2

Figure 1.1: Subjective speech quality MOS, depending on the lower and upper cut-off frequencies fc,l and fc , u

Figure 1.2: Percentage of Syllable Articulation (SA) depending on the lower/upper cut-off frequency fc

Chapter 2

The Artificial Bandwidth Extension Algorithm The ABWE algorithm ([1], [2]) has the task of extending the bandwidth of narrow-band telephone-like speech signals (with a 3.4 kHz upper cut-off frequency and a sampling frequency of 8 kHz) to a signal falling within the characteristics of wide-band speech, which has a cut off frequency of 7 kHz and a sampling frequency of 16 kHz. The goal is therefore the extraction of a set of features from the narrow-band signal in order to make a probabilistic evaluation of the most probable extension of the signal itself. The processing is driven by a pre-trained set which contains several parameters calculated in a preliminary phase, called exactly training phase. The set consists in: a codebook made of vectors of cepstral coefficients that represent the spectral envelopes whose number has to be representative in a statistical sense but not too large in order to avoid excessive complexity, a H matrix used for LDA (Linear Discriminant Analysis) transformation, which aims to the reduction of the number of the features and which is applied statically during the tests, a measure of statistical separability obtained through the LDA, a vector of state probabilities, a matrix of transition probabilities, a matrix which contains the parameters used in the EM algorithm to calculate the Gaussian Mixtures and finally the mean rfame energy of the training speech. Each of the codebook envelope entries forms a state which takes part to a Markov chain, in this sense we can say that the model followed by the speech signal driven by a hidden Markov model (HMM, [4]) at the time that we force the speech signal to follow a time-discrete stochastic state-based model (possible due to the framewise processing). The algorithm is thus based on a framewise processing, due to the fact that our analysis uses instruments like the LPC analysis which presupposes a stationarity that in the speech is found only in the short-term. Certain expedients must be taken into account when choosing the windowing criteria, for example time length of the window and the shape as we will discuss in section 2.4.1 Figure 2.1 shows the block diagram of the algorithm. We can divide the algorithm in 2 parts. The first one, explained in section 2.4, is the core of the digital processing of the speech. It performs first an LPC analysis of the speech (Linear Predictive Coding, see section ), with a filter that represents the expected inverse response of the speaker’s vocal tract (vocal tract filter), taking into account both the known narrow-band (base-band) response and an estimated response in the 4-8 kHz band (comes from the other part of the algorithm and from now will be called the ”missing band”), than applies to the residual signal at the output a ”spectral 3

2.1. Features Extraction and Observation Probabilities

z

- N

feature extraction

d

4

Training

x

estimation LPC coeff.

statistical model

y mb calculation wb envelope

a~ s nb (n' )

interpolation

a~ ~ rwb ( n )

~ rnb ( n )

snb (n)

analysis ~

estimation of wb spectral envelope

extension of excitation

(1 - A ( z ) ) - 1

synthesis ~

~ swb (n)

1 - A(z)

Figure 2.1: Block diagram of the BWE algorithm

folding” operation which is a mirroring of the spectrum around 4 kHz in order to fill the whole 0-8 kHz band. Thus, this residual excites again the vocal tract filter to obtain a wide-band signal with a base-band that is exactly the same as the original, and the upper-band that has to be, given the trained information of the algorithm, the most probable and realistic extension. The second part, explained in section 2.1 and 2.2 is the markov model that supports the LPC analysis/synthesis and the vocal tract filters. A set of features x0 is calculated from the narrow-band signal, in order to characterizes the current speech frame, and then transformed in an equivalent set x that minimizes the complexity. The observation probability P (x|Si ), calculated through a Gaussian Mixture Model, drives the course of the Markov chain, determining thus the choice of a correct set of LPC coefficients from a codebook of trained shapes in order to determine finally the values a that are applied to the vocal tract filter. While the upper path needs only to work with a 8 kHz sampled signal, the lower one relies on an upsampled signal, still limited to 4 kHz, obtained interpolating and filtering the narrowband signal. This means that the two path need to be alligned in time, in order to reestablish the frame correspondence when evaluating the missing-band envelope. The delay applied to the upper path is equal to the delay of the Finite Impulse Response (FIR) low pass filter, which is half of the filter order NF IR Nd =

2.1

NF IR . 2

(2.1)

Features Extraction and Observation Probabilities

The feature extraction is applied both during the training procedure and the technical system as support to the statistical evaluation of the observation probabilities. It draws on well-known techniques in the field of speech-pattern recognition, due to the fact that basically our purpose has the same goal: characterizing speech, which can have a wide variety of representations but keeps certain statistical characteristics on which we can rely, through a set of parameters in order to classify and discriminate it. This concept of discrimination makes efficient and qualitatively satisfactory the automatic processing.

2.1.1. Features Extraction with composite vector

5

Different measures have been proposed and established in this kind of applications. Of big interest are two of them, proposed by P. Jax in [2] and [4].

2.1.1

Features Extraction with composite vector

It consists on a composite vector x0 with a 10-dimensional vector of autocorrelation coefficients xacf plus other 5 scalar coefficients: the zero-crossing rate xzcr , the local kurtosis xlk , the spectral centroid xsc and the normed relative frame energy xnrp :   xacf    xzcr     x   gi  (2.2) x0 =  .  xlk     x   sc  xnrp The autocorrelation vector xacf is composed by 10 auto-correlation functions (ACF) xacf (λ) of the signal snb (n0 ):

xacf (λ) = ϕss (λ) =

N −1 X

snb (ν 0 ) · snb (ν 0 − λ)

ν 0 =λ N −1 X

,

(2.3)

s2nb (ν 0 )

ν 0 =0

where N is the number of samples in a frame and λ is the correlation distance. A maximum distance of 10 samples is used, thus λ = 1 . . . 10, giving rise to our 10 ACFs. The zero-crossing rate xzcr is a measure of voiced content. It is defined as the number of crossing inside the frame normalized with respect to the maximum possible number of crossings: xzcr =

N −1 X   1 1 · · |sign snb (ν 0 − 1) − sign snb (ν 0 ) |. N −1 0 2

(2.4)

ν =1

The gradient index xgi is also a measure of voiced content discriminating voiced sounds and unvoiced sounds: N −1 X ∆(ν 0 ) · |snb (ν 0 ) − snb (ν 0 − 1)| 1 ν 0 =1 v xgi = , (2.5) · uN −1 10 uX 2 0 t snb (ν ) ν 0 =0

where ∆(ν 0 ) and the sign of the gradient δ(ν 0 ) are respectively: 1 · |δ(ν 0 ) − δ(ν 0 − 1)| ∈ {0, 1}, 2  δ(ν 0 ) = sign snb (ν 0 ) − snb (ν 0 − 1) ∈ {−1, 1}.

∆(ν 0 ) =

The possibility to do this discrimination comes from the fact that, for unvoiced sounds, the signals presents more frequent changes in the sign of the slope δ(ν 0 ) = sign(snb (ν 0 ) − snb (ν 0 − 1), due to his noisy characteristic. This is reflected in the magnitude of xgi which is higher in this case that in the

2.1.1. Features Extraction with composite vector

6

case of voiced sounds. The local kurtosis xlk is the 4th order standardized moment of the speech signal: 1 N

xlk = log10 

1 N

· ·

N −1 X

s4nb (ν 0 )

ν 0 =0 N −1 X

2 .

(2.6)

s2nb (ν 0 )

ν 0 =0

It is in general a measure of ”peakedness” of a random variable and assumes a value around log10 (3) for variables with Gaussian distribution. In the case of speech signal his value remains under log10 (3) when dealing with voiced sounds in the short term while it presents peaks in the long term for plosive sounds and strong vowels. The spectral centroid xsc is the weighted mean of the amplitude spectral values from 0 to 4 kHz: N/2 X

κ · |Snb (κ)|

κ=0

xsc = 

N 2

N/2  X |Snb (κ)| +1 ·

,

(2.7)

κ=0

Where |Snb (κ)| is the amplitude of the k-th values of the speech’s DFT: Snb (κ) =

N −1 X

snb (ν 0 )e−j2πν

0 κ/N

(0 ≤ κ < N ),

ν 0 =0

Voiced sounds have more energy content in the lower frequencies (around 1.5 kHz) placing the centroid around 0.35. Unvoiced sounds have more high frequency content, therefore they present a higher value of xsc . The last scalar feature is the normalized relative frame energy: xnrp =

log10 E(m) − log10 Emin (m) , log10 E(m) − log10 Emin (m)

(2.8)

where for the current frame m we take into account the absolute energy E(m) E(m) =

N −1 X

s2nb (ν 0 ),

ν 0 =0

the moving average of the energy itself E(m) (where the step α is equal to 0.96) E(m) = α · E(m − 1) + (1 − α) · E(m), and a criteria of noise flooring   M IN E(m − µ) . Emin (m) = minN µ=0

The noise floor subtracted to both frame energy and average energy ensures to catch the variation of sounds with low power with respect to this minimum level, while the smoothed calculation of the average aims to avoid long term variations which have different characteristics from speaker to speaker and can affect in an undesired way the coherency of the measure. The number of frames considered for the noise floor is a parameter chosen on the base of the time-variation of the noise level. A value of 625 ms, which corresponds to 31 frames, is a reasonable choice according to [7]

2.1.2. Features Extraction with AURORA statistical model

x LDA

posterior probability

7

codebook

P(Si | X )

estimation mb envelope

~y

m b

power spectrum

x0

~ F m b ( e jW )

feature extraction

~ F w b ( e jW )

assembly of WB spectrum

conversion to LPC coeff.

a~

z -Nd

s nb (n' )

snb (n)

interpolation

squared periodogram

~ F b b ( e jW )

Figure 2.2: Upper path of the BWE algorithm flow graphic

2.1.2

Features Extraction with AURORA

Guglielmo coglie ghiaia dagli scogli scagliandola oltre gli scogli tra mille gorgogli. Ho in tasca l’esca ed esco per la pesca, ma il pesce non s’adesca, c’`e l’acqua troppo fresca. Convien che la finisca, non prender`o una lisca! Mi metto in tasca l’esca e torno dalla pesca. The upper path of the algorithm, whose critical part follows immediately the feature extraction and reduction and will be explained in the following, is shown in detail in Fig. 2.2. The feature extraction must be reconduced to a state-oriented model. This means that the information contained in feature vector drives a quantitative analysis of the relationship between the characteristics of the current frame, resumed in the feature vector and whose variety comes from an infinite possibility of realizations, and those of the states contained in the codebook, whose meaning will be explained later in detail. The complexity problem is then solved with the Linear Discriminant Analysis which reduces the feature dimension and with a Gaussian Mixture Model (GMM) that reduces the complexity of the problem approximating the solution in the form of a state-specific observation probability P (x|Si ). Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text

2.1.3

Linear Discriminant Analysis

The goal of the Linear Discriminant Analysis is double. First is the reduction of the feature dimension. It has been found ([1], [4]) that the composite super-vector x0 of dimension b contains some statistical redundancy that can be eliminated transforming into a vector x with dimension β < b that still keeps the same information about the feature and its separability. The separability is exactly reflected in th second, i.e. the transformation can be made in a way that the elements of the resulting vector x have enough uncorrelation that the covariance matrix assumes a diagonal shape. This is a further gain in complexity reduction regarding the later Gaussian Mixture Model, because it a permits an easier

2.1.4. Gaussian Mixture Model

8

implementation and computation. The transformation is made through the H with a matrix product: x = H−1 x0

(2.9)

H has dimension b × β and is calculated during the training process (having previously defined b and β) and stored in the training set together with the information about the codebook and the measure of separability obtainable with this transformation, namely after removing the redundant information. In [4] has been found that a good compromise is a value of β equal to 5. It brings to the expected benefits with an almost untouched discriminative performance of the features.

2.1.4

Gaussian Mixture Model

The problem of establishing a relationship between the feature vector and the states of the HMM is proposed in the form of the observation probability P (X|Si ). It represents a probability density function of the feature vector x given a certain state Si of the chain, i.e. the state that follows the event X. The observation probability must be calculated for every state of the codebook S1 . . . SNs and it is associated with the initial state probability P (Si (m))|m=1 and on the transition probabilities P (Si (m)|Sj (m − 1)) determined in the training phase. We are dealing with a problem of big entity having to determine Ns β-dimensional PDFs. A well known strategy to approximate and determine a multidimensional distribution is that of the Gaussian Mixtures [8]. Every PDF is approximated with the weighted sum of G β-dimensional Gaussian PDFs: p(x|Si ) ≈

G X g=1

ρi,g N x; µi,g , Σi,g



(2.10)

 whose set of parameters Θi,g = ρi,g , µi,g , Σi,g is calculated during the training phase through the

Expectation Maximization (EM) algorithm [] for each of the G components and stored in a matrix in the training set together with the other parameters. Every component of the mixture is then:  1   1 T −1 p N x; µi,g , Σi,g = exp − (x − µ ) Σ (x − µ ) (2.11) i,g i,g i,g 2 (2π)β/2 detΣi,g P where ρi,g is a normalization factor such that G g=1 ρi,g = 1, µi,g has the same dimension β of x and represents the mean around which is centered every Gaussian component, Σi,g is the covariance matrix. Σi,g can be in general a full matrix, but, as we already said, the LDA matrix has been built in a way that it separates the features to the β dimension with enough uncorrelation that Σi,g is diagonal. Even if it is not diagonal, the highest components are all in the diagonal and we do not commit appreciable errors saving it as a proper diagonal matrix, with conspicuous gain regarding complexity and space  demand for the storage. Therefore, for the Gaussian Mixture is stored a G + 2(G . . . β) × Ns matrix,  where every i column is relative to one of the Ns states and contains Θi,g ; g = 1 . . . G . G has been chosen equal to 8 according to [1] and [2]

2.2

Hidden Markov Model and A Posteriori Probabilities

The core of the algorithm is based on a statistical model often used in the field of speech pattern recognition, which is the hidden Markov Model [4]. It supposes the existence of a statistical criteria

2.2. Hidden Markov Model and A Posteriori Probabilities

P ( Si1 (m) | S j1 (m - 1))

9

S i1 ( m )

P( S i2 (m) | S j2 (m - 1))

S j1 ( m ) S i2 ( m ) P ( Si1 ( m) | S j2 ( m - 1))

S j2 ( m )

P(Si2 (m) | S j2 (m - 1))

Figure 2.3: Example of Markov chain with sub-set of transitions from states Sj1 and Sj2 to states Si1 and Si2

followed by the speech generation process that is equivalent to a first order Markov chain, which means that every transition depends only on the state from which the transition starts and on the event x at the same instant, not on the previous history of the system (formula 2.12). The chain is characterized by a certain number of states Ns , that correspond in our case to the different envelopes that can be used to extend the 4–8 kHz bandwidth, and by the transition and state probabilities, that are the probabilities to change from state j to state i, as we pass from frame m − 1 to frame m, and the absolute probability to be in the state i, respectively. Given that the temporal transition from frame m − 1 to frame m in the speech corresponds to the transition event in our Markov chain, and that, although these last two probabilities have a general meaning, they are not spontaneous but are excited by the event of a certain feature x, we can further characterize the process by the a posteriori probabilities, in dependence of the feature x(m). Figure 2.3 shows an example of possible transitions, from a exemplary subset of Sj states to another possible one Si , where the transition are provoked by x. The observation probabilities are not a real property of the model, because they are a consequence of its course. In fact they are determined by parameters calculated during the training phase and have the meaning of a correlation between the states and the speech features that correspond to them. Nevertheless they are of peculiar importance to determine the a posteriori probabilities. Namely, the a posteriori probability, has this property:   P S(m)|x(m), S(m − 1), S(m − 2), . . . , S(1) = P S(m)|x(m), S(m − 1) .

(2.12)

It puts in evidence the dependence on the former frame (system without memory), dependence that will become clearer as we give the mathematical formulation (forumla 2.14). The rule can be applied recursively for all the frames before the current one, so that we found another formulation, consistent with 2.12, which relates the state in th mth frame to the sequence of x vectors from the first frame to

2.3. Estimation of the missing band Envelope

10

the actual, X(m):   P Si (m)|x(m), Sj (m − 1) = P Si (m)|x(m), x(m − 1)Sj (m − 2) =

 = P Si (m)|x(m), x(m − 1), . . . , x(1), Sj (1) =  = P Si (m)|X(m), Sj (1)

(2.13)

Of course this notation do not negate the property of absence of memory typical of first order chains, as formulated at the beginning. The initial state Sj (1) remains to be computed in a different way, as we show soon, but it has to be noticed, and it has been shown during the experiments, that the effect of being in a certain state (not only at the first instant but in any moment of he computation) influences only a certain number of frames after (effect that we call ”shifting effect”) and if the manifestation of certain features has to force the chain to migrate to a certain state this will happen in any case after some tens of transitions. The a posteriori probability can be calculated from the a posteriori   probability of the former frame P Sj (m−1)|X(m−1) , the transition probability P Si (m)|Sj (m−1)  and the observation probability P x(m)|Si (m) found at the very step before. Instead of the forward

and backward recursion suggested by Jax in [2], we used Bayes theory as in [1] to find this easy formula: Ns     X P Si (m)|Sj (m − 1) · P Sj (m − 1)|X(m − 1) P Si (m)|X(m) = C · P x(m)|Si (m) ·

(2.14)

j=1

For the first frame it takes the form of the Chapman-Kolmogorov equation [1], so we initialize the chain with:   P Si (1)|X(1) = C · p x(1)|Si (1)

(2.15)

The factor C in both formulas is due to the normalization, i.e. in order to have, in every instant: Ns X i=1

 P Si (m)|X(m) = 1.

(2.16)

It is calculated from the Ns probabilities, that are computed at first without it, and then they are update, C is then given by: C=

1 Ns X i=1

2.3

.

(2.17)

 P Si (m)|X(m)

Estimation of the missing band Envelope

Based on the statistical consideration made during the last phase we are now able to estimate the envelope of the 4-8 kHz band in form of a vector of cepstral coefficients ymb , from a set of Ns vectors, calculated during the training phase and assigned each one to a state of the HMM. The rule of correspondence can be written as: ˆ mb,i = E{ymb |Si }, i = 1 . . . Ns y

(2.18)

Different estimation rules have been proposed in [1], [2], [9], [10]. Three basic rules can be defined, based on the a posteriori and observation probabilities. The simplest one is called the maximum likelihood (ML), and does not even take into account the a posteriori probabilities, but using only

2.4. weiteres Unterkapitel

11

the observation ones. It practically ignores the HMM, having thus great simplicity but unfortunately poor performance. The ML choses thus the state for which the probability that the given x(m) has forced the transition to the same state SiM L is the highest:   s iM L = arg maxN i=i p x(m)|Si (m)

(2.19)

˜ mb,M L is the one associated with the state SiM L . The right cepstral vector y ˜ mb,M L (m) = E{ymb |SiM L (m)} = y ˆ mb,iM L (m) y

(2.20)

Another simple rule is called the Maximum A Posteriori (MAP), which takes the state for wich the a posteriori probability is the highest:   s iM AP = arg maxN i=i P Si (m)|X(m) ,

(2.21)

˜ mb,M AP (m) = E{ymb |SiM AP (m)} = y ˆ mb,iM AP (m). y

(2.22)

and thus, given the state, the correct spectral envelope is assigned:

These two last strategies have the disadvantage of a simplistic evaluation, restricting the range of possible envelops to exactly those of the codebook. Another possibility is the Minimum Mean Square Error (MMSE), called also soft classification, where the chosen envelope is a mean of the codebook entries, weighted by the probabilities of each entry: ˜ mb,M M SE (m) = y

Ns X i=1

 ˆ mb,i P Si (m)|(m) . y

(2.23)

The name derives from the fact that the result comes from the minimization of the squared distance between real and estimated envelope:  ˜ 2 |X(m) ⇒ min. E ky(m) − y(m)k

(2.24)

See [10] for details about the derivation. With this strategy

2.4 2.4.1

2.5

weiteres Unterkapitel framing 160 40 40

Insertion of Graphics

Dies ist ein Verweis auf Abbildung 2.4, die Latex automatisch an der bestm¨ oglichen Stelle platziert.1.1 1.2

2.6

Formulas

Beispiel f¨ ur eine Formel mit Integral und Summenzeichen. Z ∞ ∞ X 1 f (x) ≈ f (x) dx 6= 2 3 3

Und so verweist man auf die Formel: 2.25.

(2.25)

2.6. Formulas

12

40

30

/s/ 20

10

0

−10

−20 0

1

2

3

4 Frequency [kHz]

5

6

7

Figure 2.4: Example of typical /s/ and /z/ male sounds spectra

1 0.9 0.8

P(Si(m)|Sj(m−1))

Magnitude [dB]

/z/

0.7 0.6 0.5 0.4 0.3 0.2

0

0.1 5

0 0 10 5 15

10 15 20 20 25

25

S (m−1) j

Si(m)

Figure 2.5: Transition probability matrix of the 25 entries codebook

8

2.7. Zitate

2.7

Zitate

So zitiert man einen bestimmten Eintrag des Literaturverzeichnisses: [1].

13

Chapter 3

Conclusions Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text Text

14

Abbreviations ABWE

Artificial Bandwidth Extension

ACF

Auto-correlation Function (ACC are the relative coefficients)

DFT EM

Discrete Fourier Transform Expectation Maximization

GMM

Gaussian Mixture Model

HMM LDA

Hidden Markov Model Linear Discriminant Analysis

LPC MOS

Linear Predictive Coding Mean Opinion Score

PDF

Probability Density Function

xxx

Name

15

Bibliography [1] P. Bauer, Artificial Bandwidth Extension with Multilingual Training Process. Diploma Thesis, Technical University Braunschweig, Institute for Communications Technology, July 2007. [2] P. Jax, Enhancement of Bandlimited Speech Signals: Algorithms and Theoretical Bounds, Ph.D. thesis, vol. 15 of P. Vary (ed.), Aachener Beitr¨age zu digitalen Nachrichtensystemen, Nov. 2002. [3] T. Fingsheidt, ”Speech Communication” Lecture Script, Technical University Braunschweig, Institute for Communications Technology, Winter 2007-08. [4] P. Jax, P. Vary, ”On Artificial bendwith extension of telephone speech”, Signal Processing, vol. 83, no. 8, pp. 1707-1719, 2003. [5] P. Jax, P. Vary, ”Feature selection for improved bandwidth extension of speech signals”, Proc. of ICASSP ’04, Vol. 1, 17-21 May 2004, pp. I-697-700. [6] T. Fingsheidt, ”(Artificial) Wideband Speech Communication in the Car”, in Proc. of IMA 2006, Braunschweig, Germany, Oct. 2006. [7] R. Martin, ”Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol.9, no.5, pp.504512, July 2001. [8] D. A. Raynolds, R. C. Rose, ”Robust Text-Independent Speaker Identification Using Gaussian Speaker Models”, IEEE Transactions on Speech Audio Processing, vol.3, no.1, pp.72-83, Jan. 1995. [9] P. Jax, P. Vary, ”Wideband Extension of Telephone Speech Using a Hidden Markov Model”, IEEE Workshop on Speech Coding, Delavan, WI, USA, sept. 2000, pp. 133-135. [10] P. Jax, P. Vary, ”Artificial bandwidth extension of speech signals using MMSE estimation based on a hidden Markov model”, Proceedings of the ICASSP, vol. I, Hong Kong, April 2003, pp.680683.

16

Related Documents

Diplomarbeit
August 2019 24
Diplomarbeit
November 2019 17
Diplomarbeit 1995
April 2020 11
Diplomarbeit - Praeposition
November 2019 24