Introduction to Signal Processing and Some applications in audio analysis
Md. Khademul Islam Molla JSPS Research Fellow Hirose-Minematsu Laboratory Email:
[email protected]
Outlines of the presentation Basics of discrete time signals Frequency domain signal analysis Basic Transformations Fourier Transform (FT), short-time FT (STFT) Wavelet Transform (WT) Empirical mode decomposition (EMD), and Hilbert spectrum (HS) Remarkable comparisons among FT, WT, HS Some applications in audio processing Some open problems to work with
Discrete time signal It is not possible to process continuous signals We need to make it discrete time signal with suitable sampling frequency and quantization The sampling theory Fs≥ 2fc where, fc expected signal frequency, Fs required sampling frequency Quantization is required in sampling
Discrete time signal
Signal sampling
Signal quantization
Discrete time signal
Effects of under sampling
Discrete time signal
Effects of required sampling frequency
Discrete time signal Telephone speech is usually sampled at 8 kHz to capture up to 4 kHz data 16 kHz is generally regarded as sufficient for speech recognition and synthesis The audio standard is a sample rate of 44.1 kHz (CD) or 48 kHz (Digital Audio Tape) to represent frequencies up to 20 kHz
Discrete time signal 5 4
Amplitud e
f(x) = 5 cos (x)
3 2 1 -10
-5
0 -1 0
5
10
-2 -3 -4 -5
5
Phase f(x) = 5 cos (x + 3.14)
3 1 -10
-5
-1 0
5
10
5
10
-3 -5
Frequenc f(x) = 5 cos (3 x + y 3.14)
5 3 1 -10
-5
-1 0 -3 -5
Time-domain signals
1
1
0.5
0.5
Magnitude
2 Hz
Magnitude
The Independent Variable is Time The Dependent Variable is the Amplitude Most of the Information is Hidden in the Frequency Content
0 -0.5 -1 0
0.5
0 -0.5 -1 0
1
4
0.5
2
0 -0.5 -1 0
0.5
Time
0.5
1
Time
1
Magnitude
Magnitude
Time
20 Hz
10 Hz
1
2 Hz + 10 Hz + 20Hz
0 -2 -4 0
0.5
Time
1
Signal Transformation Why To obtain a further information from the signal that is not readily available in the raw signal.
Raw Signal Normally the time-domain signal
Processed Signal A signal that has been "transformed" by any of the available mathematical transformations
Fourier Transformation The most popular transformation between time and frequency domains
Frequency domain analysis
Why Frequency Information is Needed Be able to see any information that is not obvious in time-domain
Types of Frequency Transformation Fourier Transform, Hilbert Transform, Short-time Fourier Transform,the Radon Transform, the Wavelet Transform …
Frequency domain analysis •Powerful & complementary to time domain analysis methods •Frequency domain representation shows the signal energy and phase with respect to frequency •Fast and efficient way to view signal’s information
time, t
analysis
General Transform as problem-solving tool
frequency, f
F S(f) = F[s(t)]
s(t)
s(t), S(f) : Transform Pair
synthesis Basic block diagram of signal transformation
Frequency domain analysis Complex numbers 4.2 + 3.7i 9.4447 – 6.7i -5.2 (-5.2 + 0i)
General Form Z = a + bi Re(Z) = a Im(Z) = b
Amplitude A = | Z | = √(a2 + b2)
Phase ϕ = ∠ Z = tan-1(b/a)
Frequency domain analysis Polar Coordinate Z = a + bi Amplitude A = √(a2 + b2) Phase ϕ = tan-1 (b/a)
ℑ A
ϕ a
b
ℜ
Frequency domain analysis Frequency Spectrum Be basically the frequency components (spectral components) of that signal Show what frequencies exists in the signal
Fourier Transform (FT) One way to find the frequency content Tells how much of each frequency exists in a signal
Spectrum of speech signal
Fourier Transform •Fourier transform decomposes a function into a spectrum of its frequency components, •The inverse transform synthesizes a function from its spectrum of frequency components •Discrete Fourier transform pair is defined as:
Where Xk represents the frequency component
Where xn represents nth sample in time domain
Fourier Transform 5
Amplitude Only
4 3 2 1 0 -1 0
200
400
600
800
1000
1200
1400
-2 -3 -4
5
10 (Hz)
15
5
10 (Hz)
15
-5
5 4 3 2 1 0 -1 0
200
400
600
800
1000
1200
1400
-2 -3 -4 -5
Fourier Trans. of 1D signal 5 4 3 2 1 0 -1 0 -2 -3 -4 -5
200
400
600
800
1000
1200
1400
5
10 (Hz)
15
Fourier Spectrum of 1D
Fourier Transform Fourier analysis uses Sinusoids as the basis function in decomposition Fourier transforms give the frequency information, smearing time Samples of a function give the temporal information, smearing frequency
FS synthesis Square wave reconstruction from spectral terms 1.5 1.5 13 7 5 11 9 sw11 (t) (t)===∑ ⋅sin(kt) sin(kt) ]] ] sin(kt) ∑ ∑[[--[b-bbkkk⋅⋅⋅sin(kt) 7 5 13 9(t) kkk==1 =11
square signal, sw(t)
11 0.5 0.5 00 -0.5 -0.5 -1 -1 -1.5 -1.5 00
22
44
tt
66
88
10 10
Convergence may be slow (~1/k) - ideally need infinite terms. Practically, series truncated when remainder below computer tolerance (⇒ error). BUT … Gibbs’ Phenomenon.
Stationarity of the signal Stationary Signal Signals with frequency content unchanged over the entire time All frequency components exist at all times
Non-stationary Signal Frequency changes in time One example: the “Chirp Signal”
Stationarity of the signal 3
6 0 0
2
5 0 0
Magnitude
Stationary
Magnitude
2 Hz + 10 Hz + 20Hz
1
0
4 0 0
3 0 0
1
2 0 0
2
1 0 0
3 0
0 . 2
0 . 4
Time
0 . 6
0 . 8
Occur at all times
0 0
1
5
1 0
1 5
2 0
2 5
Frequency (Hz)
Do not appear at all times 2 5 0
0 . 8 0 . 6
Magnitude
NonStationary
1
Magnitude
0.0-0.4: 2 Hz + 0.4-0.7: 10 Hz + 0.7-1.0: 20Hz
0 . 4 0 . 2 0 0 . 2
2 0 0
1 5 0
1 0 0
0 . 4 0 . 6
5 0
0 . 8 1 0
0 . 5
Time
1
0 0
5
1 0
1 5
Frequency (Hz)
2 0
2 5
Chirp signal Frequency: 2 Hz to 20 Hz
Different in Time Domain 150
0.8
0.8
0.6
0.6
0.4
Magnitude
0.2 0 -0.2
100
0.2 0 -0.2
50
-0.4
50
-0.4
-0.6
-0.6
-0.8
-0.8
-1 0
150
0.4
100
Magnitude
Magnitude
1
Magnitude
1
Frequency: 20 Hz to 2 Hz
0.5
Time
1
0 0
5
10
15
20
Frequency (Hz)
25
-1 0
0.5
Time
1
0 0
5
10
15
20
Frequency (Hz)
Same in Frequency Domain At what time the frequency components occur? FT can not tell!
25
Limitations of Fourier Transform FT Only Gives what Frequency Components Exist in the Signal The Time and Frequency Information can not be seen at the same time Time-frequency representation of the signal is needed Most of Signals are Non-stationary
ONE SOLUTION: SHORT-TIME FOURIER TRANSFORM (STFT)
Short Time Fourier Transform Dennis Gabor (1946) used STFT To analyze only a small section of the signal at a time -- a technique called Windowing the Signal.
The segment of signal is assumed stationary A 3D transform STFTX( ω ) ( t ′, f ) = ∫ [ x( t ) • ω* ( t − t ′) ] • e − j 2 πft dt t
ω( t ) : the window function A function of time and frequency
Short time Fourier Transform
FT
FT
Speech signal and its STFT
Drawbacks of STFT Unchanged Window Dilemma of Resolution Narrow window -> poor frequency resolution Wide window -> poor time resolution
Heisenberg Uncertainty Principle Cannot know what frequency exists at what time intervals Via Narrow Via Wide Window Window
Wavelet Transform To overcome some limitations of Fourier transform
S S D
A1
1
D2
A2 A3
D3
Discrete Wavelet decomposition
Wavelet Overview
Wavelet
A small wave
Wavelet Transforms
Provide a way for analyzing waveforms, bounded in both frequency and duration Allow signals to be stored more efficiently than by Fourier transform Be able to better approximate real-world signals Well-suited for approximating data with sharp discontinuities
“The Forest & the Trees”
Notice gross features with a large "window“ Notice small features with a small "window”
Multi-resolution analysis Wavelet Transform An alternative approach to the short time Fourier transform to overcome the resolution problem Similar to STFT: signal is multiplied with a function
Multi-resolution Analysis Analyze the signal at different frequencies with different resolutions Good time resolution and poor frequency resolution at high frequencies Good frequency resolution and poor time resolution at low frequencies More suitable for short duration of higher frequency; and longer duration of lower frequency components
Advantages of WT over STFT
Width of the Window is Changed as the Transform is Computed for Every Spectral Components Altered Resolutions are Placed
Principles of WT
Split Up the Signal into a Bunch of Signals Representing the Same Signal, but all Corresponding to Different Frequency Bands Only Providing What Frequency Bands Exists at What Time Intervals
Principles of WT 1 * t − τ CWT ( τ, s ) = Ψ ( τ, s ) = x( t ) • ψ dt ∫ s s ψ x
ψ x
Translation (The location of the window)
Wavelet
Scale Mother Wavelet
Small wave Means the window function is of finite length
Mother Wavelet A prototype for generating the other window functions All the used windows are its dilated or compressed and shifted versions
Wavelet bases
Principles of WT
Time domain
Frequency domain
-1
2
j ω η −η Wavelet Basis Functions: Morlet (ω0 = frequency ) : π 4 e 0 e
2
2 m i m m! Paul ( m = order ) : DOG (1 − iη) −( m+1) π( 2m )! DOG ( m = devivative ) : Derivative Of a Gaussian M=2 is the Marr or Mexican hat wavelet
( - 1) m+1
(
d m −η2 e m 1 dη Γ m + 2
2
)
Scale of wavelet Scale S>1: dilate the signal S<1: compress the signal
Low Frequency -> High Scale -> Nondetailed Global View of Signal -> Span Entire Signal High Frequency -> Low Scale -> Detailed View Last in Short Time Only Limited Interval of Scales is Necessary
Computation of WT CWT xψ ( τ, s ) = Ψxψ ( τ, s ) =
1 s
* t − τ x ( t ) • ψ dt ∫ s
Step 1: The wavelet is placed at the beginning of the signal, and set s=1 (the most compressed wavelet); Step 2: The wavelet function at scale “1” is multiplied by the signal, and integrated over all times; then multiplied by ; 1 s Step 3: Shift the wavelet to t= , and get the transform value at t= and s=1;τ τ Step 4: Repeat the procedure until the wavelet reaches the end of the signal; Step 5: Scale s is increased by a sufficiently small value, the above procedure is repeated for all s; Step 6: Each computation for a given s fills the single row of the time-scale plane; Step 7: CWT is obtained if all s are calculated.
Time & Frequency Resolution Better time resolution; Poor frequency resolution Frequency
Better frequency resolution; Poor time resolution
Time • Each box represents a equal portion • Resolution in STFT is selected once for entire
Comparison of transformations
Discretization of WT It is Necessary to Sample the TimeFrequency (scale) Plane. At High Scale s (Lower Frequency f ), the Sampling Rate N can be Decreased. The Scale Parameter s is Normally Discretized on a Logarithmic Grid. The most Common Value is 2.
N 2 = s1 s2 ⋅ N1 = f1 f 2 ⋅ N1
S N
2 4 32 16
8 8
… …
Effective and Fast DWT The Discretized WT is not a True Discrete Transform Discrete Wavelet Transform (DWT) Provides sufficient information both for analysis and synthesis Reduce the computation time sufficiently Easier to implement S S Analyze the signal at different frequency bands with different resolutions A Decompose the signal into a coarse approximation and detail information D A 1
2
2
A3
D3
D1
Decomposition with DWT Halves the Time Resolution Only half number of samples resulted
Doubles the Frequency Resolution The spanned frequency band halved
0-1000 Hz
X[n]5 12
Filter 1
256 S S D1
A1 D2
A2 A3
D3
A1
256
Filter 2
128
D1: 500-1000 Hz
128 Filter 3
A2
64
D2: 250-500 Hz
64
D3: 125-250 Hz
A3: 0-125 Hz
Decomposition of nonstationary signal fL
Signal: 0.0-0.4: 20 Hz 0.4-0.7: 10 Hz 0.7-1.0: 2 Hz fH
Wavelet: db4 Level: 6
Decomposition of nonstationary signal fL
Signal: 0.0-0.4: 2 Hz 0.4-0.7: 10 Hz 0.7-1.0: 20Hz fH
Wavelet: db4 Level: 6
Reconstruction from WT What How those components can be assembled back into the original signal without loss of information? A Process After decomposition or analysis. Also called synthesis
How Reconstruct the signal from the wavelet coefficients Where wavelet analysis involves filtering and downsampling, the wavelet reconstruction process consists of upsampling and filtering
Reconstruction from WT Lengthening a signal component by inserting zeros between samples (upsampling) MATLAB Commands: idwt and waverec.
Wavelet Applications Typical Application Fields Astronomy, acoustics, nuclear engineering, subband coding, signal and image processing, neurophysiology, music, magnetic resonance imaging, speech discrimination, optics, fractals, turbulence, earthquake-prediction, radar, human vision, and pure mathematics applications
Sample Applications De-noising signals Breakdown detecting Detecting self-similarity Compressing images Identifying pure tone
Signal De-noising Highest Frequencies Appear at the Start of The Original Signal Approximations Appear Less and Less Noisy Also Lose Progressively More High-frequency Information. In A5, About the First 20% of the Signal is Truncated
Breakdown Detection The Discontinuous Signal Consists of a Slow Sine Wave Abruptly Followed by a Medium Sine Wave. The 1st and 2nd Level Details (D1 and D2) Show the Discontinuity Most Clearly Things to be Detected The site of the change The type of change (a rupture of the signal, or an abrupt change in its first or second derivative) The amplitude of the change
Discontinuit y Points
Detecting Self-similarity Purpose
How analysis by wavelets can detect a self-similar, or fractal, signal. The signal here is the Koch curve -- a synthetic signal that is built recursively
Analysis
If a signal is similar to itself at different scales, then the "resemblance index" or wavelet coefficients also will be similar at different scales. In the coefficients plot, which shows scale on the vertical axis, this selfsimilarity generates a characteristic pattern.
Image Compression Fingerprints
FBI maintains a large database of fingerprints — about 30 million sets of them. The cost of storing all this data runs to hundreds of millions of dollars.
Results
Values under the threshold are forced to zero, achieving about 42% zeros while retaining almost all (99.96%) the energy of the original image. By turning to wavelets, the FBI has achieved a 15:1 compression ratio better than the more traditional JPEG compression
Identifying Pure Tone Purpose
Resolving a signal into constituent sinusoids of different frequencies The signal is a sum of three pure sine waves
Analysis
D1 contains signal components whose period is between 1 and 2. Zooming in on detail D1 reveals that each "belly" is composed of 10 oscillations. D3 and D4 contain the medium sine frequencies. There is a breakdown between approximations A3 and A4 -> The medium frequency been subtracted. Approximations A1 to A3 be used to estimate the medium sine. Zooming in on A1 reveals a period of around 20.
Empirical Mode Decomposition Principle
Objective — From one observation of x(t), get a AM-FM type representation : K
k=1
x(t) =
Σ
ak(t) Ψk(t)
with ak(.) amplitude modulating functions and Ψk(.) oscillating functions. Idea — “signal = fast oscillations superimposed to slow oscillations”. Operating mode — (“EMD”, Huang et al., ’98) (1) identify locally in time, the fastest oscillation ; (2) subtract it from the original signal ; (3) iterate upon the residual.
Empirical Mode Decomposition Principle
1
A LF sawtooth
0
-1
+
0
1
1
A linear FM
0
-1 0
=
1
0
0
1
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
FirstIntrinsicM odeFunction
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
S econdIntrinsicM odeFunction
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
ThirdIntrinsicM odeFunction
Empirical Mode Decomposition Algorithmic definition
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
R esidu
S I F T I N G P R O C E S S
Empirical Mode Decomposition Algorithmic definition
S ignal
1stIntrinsicM odeFunction
2ndIntrinsicM odeFunction
3rdIntrinsicM odeFunction
R esidu
Empirical Mode Decomposition Intrinsic Mode Functions
— Quasi monochromatic harmonic oscillations #{zero crossing} = #{extrema} ± 1 symmetric envelopes around the y=0 axis — IMF ≠ Fourier mode and, in nonlinear situations, IMF = several Fourier modes — Output of a self-adaptive time-varying filter (≠ standard linear filter) ex: 2 sinus FM + gaussian wave packet
Empirical Mode Decomposition Instantaneous frequency (IF)
— Analytic version of each IMF Ci(t) is computed
using Hilbert transform as: z i (t ) = C i (t ) + jH [C i (t )] = a i (t )e jθi (t )
— hence zi(t) becomes complex with phase and
amplitude. Then IF can be computed as: dθi (t ) ωi (t ) = dt
-Hilbert spectrum (HS) is a triplet as H(ω ,t)={t, ω i(t), ai(t)}
Empirical Mode Decomposition Intrinsic Mode Functions Signal
time
Spectrum
frequency
Time-Frequency representation
Empirical Mode Decomposition Intrinsic Mode Functions
S ignal
Empirical Mode Decomposition Intrinsic Mode Functions
Signal
1st IMF frequency
time
2nd IMF
3rd IMF
Empirical Mode Decomposition Intrinsic Mode Functions
•EMD is fully data adaptive decomposition for spectral and time-frequency representation of non-linear and non-stationary time series •It does not employ any basis function for decomposition •It produces perfect localization of the signal components in high resolution time-frequency space of the time series
Empirical Mode Decomposition Comparison between HS and STFT
Time-frequency representation of two pure tones (100Hz and 250Hz) using HS and STFT
Hilbert spectrum (HS)
STFT
Empirical Mode Decomposition Comparison between Wavelet and HS
Wavelet
Hilbert spectrum (HS)
Remarks on FT •Fourier Transform has a mathematical foundation •Can be used in robust analysis having phase information •The detail signal information is limited with the basis (sinusoid) function •STFT analysis includes some addition cross-spectral energy that degrades the performance in some applications
Remarks on WT •WT employs data adaptive basis function base on its time and frequency scales •It can produce more detail signal information in T-F representation •WT also perform well in multi-band decomposition •The reconstruction error of multi-band representation is much less than the FT •It can not preserve the phase information for perfect reconstruction from T-F space
Remarks on EMD and HS •EMD is fully adaptive multi-band decomposition method •It produces the perfect localization of signal components in T-F space •HS can represent the instantaneous spectra of the signal •The signal can be reconstructed with negligible error terms •It does not have mathematical foundation yet •It is difficult to use EMD based decomposition in robust analysis
Application of DSP in audio analysis •Audio source separation from mixture using independent subspace analysis (ISA) •Audio source separation by spatial localization in underdetermined case •Robust pitch estimation using EMD
Application of DSP in audio analysis
Audio source separation from mixture using independent subspace analysis (ISA)
Source separation by Independent subspace analysis (ISA) Audio mixture
s(t) STFT STFTM
φ
Source spectrograms
ISTFT
PCA Basis vector selection
ICA
Basis vector clustering
Individual sources
T-F representation of mixture Mixture Audio
Short Time Fourier Transform (STFT) Magnitude Spectrogram X
Phase Information window 30ms Overlap20ms
Proposed separation model Mixture spectrogram X=∑ xi
xi=BiAi
Source Spectrograms
Bi = [b1(i ) , b2(i ) ......bn(i ) ]
Ai =
(i ) (i ) (i ) T [a1 , a2 ........an ]
Bi Invariant frequency n-component basis Ai Corresponding amplitude envelope
Ai=BiTX,
Bi=XAiT
** To find independent Bi or Ai
Dimension Reduction Rows or columns of X > > number of sources Subject to reduce the dimension of X Singular value decomposition (SVD) is used Xn× k=Un× nSn× kVk× kT
-U and V orthogonal matrices (column-wise) -S diagonal matrix of elements (singular values) σ 1≥ σ 2≥ σ 3 ≥ .… σ n≥ 0
p basis vectors (from U or V) are selected by setting ϕ =0.5 to 0.6 in inequality
1 n
∑σ i i =1
p
∑σ i =1
i
≥ϕ
Proposed separation model To derive the basis vectors Singular value decomposition (SVD) is applied as PCA Some principal components are selected as basis vectors Independent component analysis (ICA) is applied to make the bases independent
Independent basis vectors
before ICA
after ICA
**The bases Independent along time frames
Producing source subspaces The bases of the speech signal
Time bases
Frequency bases
Source Subspaces (cont.) Mixture Spectrogram PCA+Basis Selection+ICA B
Basis vectors
A
KLd based clustering B1A1
Source Subspaces Source Spectrograms
B2A2
Source re-synthesis Separated subspaces (spectrograms) Mixture of speech & bip-bip sound Inverse STFT Separated speech Append phase information
Si (n, k ) = xi .e
j [φ ( n ,k )] Separated bip-bip sound
Experimental results Separated signals with proposed algorithm
mixtures Speech+bip-bip
Male+female speech
separated
Application of DSP in audio analysis
Audio source separation by spatial localization in underdetermined case
Localization based separation To avoid the spectral dependency and signal content in separation To increase the number of sources The spatial location is considered The use of Binaural mixtures instead of single mixture
Localization based separation (cont.) Consider a multi-source audio situation Human can easily localize and separate the sources by HAS (human auditory system) The binaural cues ITD and ILD are mainly used in source localization Separation is performed by applying Beamforming and Binary mask
Source localization Cues • Interaural time difference (ITD) between two microphones’ signals (like two ears of human) • Interaural level difference (ILD)
ITD
ILD
Source localization Xr(ω ) and Xl(ω ) are STFT of xr(t) and xl(t) ITD and ILD are calculated as | X l (ω ) | 1 τ I T D(ω ) = {φ l (ω ) − φ r (ω ) } κ ILD(ω ) = 20log | X ( ω ) | ω r
where φ r(ω ) and φ l(ω ) are unwrap phase of Xr(ω ) and Xl(ω ) respectively at frequency ω
Source localization (cont.) ITD becomes ambiguous at higher frequency (factor of mics` spacing) ITD
ITD
At high frequency AtILD low dominates frequency to resolve the problem
Source localization (cont.) ITD and ILD are quantized into 50 levels Collection of T-F points corresponding to each ITD/ILD quantized pair produces peaks
Separation by beamforming ITD is derived for each of the localized sources Spatial Beamforming is applied Linearly constrained minimum variance Beamforming (LCMVB) is used The gain is selected based on the spatial locations
Separation by binary mask with HS It is required to avoid the limitations of spatial beamforming Separation is performed by binary mask estimation based on ITD/ILD The sources are considered as disjoint orthogonal in T-F space not more than one source is active at any T-F point
Computing ITD and ILD Each mixture is transformed to T-F domain using Hilbert spectrums (HL and HR) ITD and ILD are measured as:
1 H L (ω , t f ) H R (ω , t f ) { ITD(ω , t f ), ILD(ω , t f )} = ∠ , ω H R (ω , t f ) H L (ω , t f ) H (ω , t ) 2 f R ILDdB (ω , t f ) = 20 log 10 2 H L (ω , t f ) where tf is the time frame
ITD-ILD Space Localization ITD and ILD are quantized into 50 levels Collection of T-F points from HS corresponding to each ITD/ILD quantized pair produces peaks
Source Separation Each peak region in the histogram refers to a source of the binaural mixtures Construct a binary mask (nullifying T-F points of interfering sources) Mi(ω ,t) The HS of ith source is separated as
H i (ω , t ) = M i (ω , t ) H L (ω , t )
Time domain ith source is given as
si (t ) = ∑ H i (ω , t ) ⋅ cos[φ (ω , t )] ω
Source disjoint orthogonality Disjoint orthogonality (DO) of audio sources assumes that not more than one source is active at any T-F point
F1 (ω , t ) F2 (ω , t ) = 0; ∀ ω , t where F1 and F2 are TFR of two signals SIR (signal to interference ratio) is used as the basis to measure DO
Source disjoint orthogonality s1
Frequency
s2
(cont.) s3
Three audio sources Microphone s TFR
s1 s2 s3
Time
Source disjoint orthogonality (cont.)
The SIR of the jth source is defined as:
SIR j = ∑ ∑ ω
t
X j (ω , t ) Y j (ω , t )
; Y j (ω , t ) ≠ 0
N
Y j (ω , t ) = ∑ X i (ω , t ) i =1 i≠ j
Yj sum of interfering sources
Source disjoint orthogonality (cont.)
Dimensions of HS and STFT of same signal may be different DO is defined as the percentage over the entire TFR region Average DO (ADO) of all sources is
1 ADO = N
N
∑
SIR j
j =1
N number of sources
Experimental results The three mixtures are defined as m1{sp1(-40°, 0°), sp2(30°, 0°), ft(0°, 0°)}, m2{sp1(20°, 10°), sp2(0°, 10°), ft(-10°,10°)}, m3{sp1(40°, 20°), sp2(30°, 20°), ft(-20°, 20°)}
The separation efficiency is measured as OSSR (original to separated signal ratio) defined as: w OSSR =
1 T
T
log 10
∑ t =1
(t + i ) i =1 w 2 s separated (t + i ) i =1
∑
∑
2 s original
Experimental results (cont.) The comparative separation efficiency (OSSR) using HS and STFT : Mixtures m1 m2 m3
TFR
OSSR of sp1
OSSR of sp2
OSSR of ft
HS
-0.0271
0.0213
0.0264
STFT
0.0621
-0.0721
-0.0531
HS
0.0211
-0.0851
-0.0872
STFT
0.0824
0.1202
0.1182
HS
0.0941
-0.0832
0.0225
STFT
-0.1261
0.1092
-0.0821
Experimental results This experiment also compares the DO using HS and STFT as TFR STFT is affected by many factors window function and its length, overlapping, FFT points HS is independent of such factors It is slightly affected by the number of frequency bins used in TFR
Experimental results (cont.) The ADO of HS and STFT as a function of number of frequency bins (N=3):
Experimental results (cont.) The ADO of only STFT is affected by the factor of window overlapping (%)
Experimental results (cont.) STFT includes more cross-spectral energy terms The TFR of two pure tones using HS and STFT
Experimental results (cont.) Always HS has better DO for audio signals DO depends on the resolution of TFR 1 STFT has to satisfy the inequality
∆ t∆ω ≥
2
The frequency resolution of HS is up to Nyquist frequency Its time resolution is up to sampling rate and hence offers better resolution
Remarks The separation efficiency is independent of the signal’s spectral characteristics The performance is affected by the apart angles and disjointness of the sources HS produces better disjointness in T-F domain and hence better separation The Binaural mixtures are recorder in anechoic room of NTT
Application of DSP in audio analysis
Robust pitch estimation using EMD
Why EMD in pitch estimation? Pitch facilitates speech coding, enhancement, recognition etc. Autocorrelation function is mostly used in pitch estimation algorithm Autocorrelation (AC) functionrecalls the periodic property of the speech
EMD in pitch estimation (cont.)
EMD in pitch estimation (cont.) Pitch is the sample difference between two consecutive peaks in AC function Sometimes the pitch peak may be less prominent specially due to noise
EMD in pitch estimation (cont.) EMD decomposes any signal into higher to lower frequency component It produces the local and global oscillations of the signal The global oscillation almost represents the envelop of the signal The IMF of global oscillation is used to estimate the pitch
Pitch estimation with EMD
Pitch estimation with EMD (cont.) There exists an IMF in EMD domain representing the global oscillation of the AC function That IMF represents the sinusoid of the pitch period Pitch is the frequency of that IMF rather than finding the pitch peak
Pitch estimation with EMD (cont.) In EMD, IMF-5 is the oscillation of pitch period It is a crucial step to determine the target IMF representing the sinusoid with pitch period The IMF of low frequency oscillation (than pitch period) can be discarded by energy thresholding
Pitch estimation with EMD (cont.) A reference pitch is computed by weighted AC (WAC) method Such pitch information is used to select the IMF with pitch period The periodicity of the selected each IMF is computed as pitch period
Pitch estimation with EMD (cont.) The peak at zero-lag is selected Two cycles are selected from both sides Average samples are the periodicity
Proposed Pitch estimation Algorithm Normalized autocorrelation (AC) of the speech frame is computed Determine rough pitch period using WAC method Apply EMD on AC function Select the IMF of pitch period on the basis of WAC based method The period of the selected IMF is the estimated pitch
Experimental results Keele pitch database is used here 20kHz sampling rate Frame length is 25.6ms with 10ms shifting Each frame is filtered by band-pass filter of pitch range (50-500Hz) Gross pitch error (GPE) is used to measure the performance
Experimental results (cont.) The %GPE of male and female speech with different SNR are presented here Total number of frames is 1823 SNR 30dB 20dB 10dB 0dB -5dB -15dB Female 1.90 2.83 3.93 10.12 21.83 64.24 Male
2.15 3.78 5.22 11.89 23.56 66.76
Remarks The use of EMD makes the pitch estimation method more robust EMD of AC function can extract the fundamental oscillation of the signal The pitch can be easily estimated from the single sinusoid of fundamental oscillation It is not affected by the prominent nonpitch peak
Future works The open problem is to identify the IMF with pitch period In present algorithm the error to estimate pitch roughly in ACF can propagate to the performance of final estimation The performance is not yet tested with other existing algorithm
Open Problem-1
Instantaneous Pitch (IP) estimation using EMD •Frame based pitch estimation is already done •Paper is accepted by EUROSPEECH 2007
Three methods of pitch estimation
•We have used the pitch information based WAC to compute the exact pitch (IMF) from EMD space •Problem to compute IP only from EMD space
Open Problems-2
Voiced/Unvoiced Detection with EMD
•Useful in speech enhancement and speech/speaker recog. •Paper with preliminary results is published in ICICT2007
V/UV differentiation
•Problem to derive better separation region for V/UV and to conduct experiment with large speech data
Open Problems-3
Robust Audio Source Localization
•Localization is done by delay-attenuation computed in T-F space of binaural mixtures- NOT noise robust
Localization of three sources using TD-LD computed in T-F space
•The problem is to derive mathematical mode for robust localization in underdetermined situation
Open Problems-4
Speech denoising using image processing
•Noisy speech can be represented as an image with time-frequency (T-F) representation e.g. Spectrogram
Speech with white noise
•Image processing algorithm can be used for denoising •It seems easy for musical/white noises
•Problem is to deal with other noise even by using binaural mixtures
Open Problems-5
Auditory segmentation with binaural mixtures •Auditory segmentation is the first stage of source separation using auditory scene analysis (ASA) Source separation by ASA
•Problem is to use of binaural mixtures for improved auditory segmentation as source separation •T-F representation other than FT can be employed
Open Problems-6
Two stage speech enhancement
•Single stage speech enhancement is not efficient in all noisy situations •For example, musical noise is introduced with binary masking and some thresholding methods •Noise may not be separated perfectly by using ICA, ISA (independent subspace analysis) based techniques •Multi-stage enhancement with suitable order can improve the performance Noisy speech
First stage enhancement
Second stage enhancement
Clean speech
Open Problems-7
Informative features extraction
•To use spectral dynamics in speech/speaker recog. •Special type of speech features are required
Mixed signal with its spectrogram
•How to parameterized speech signal to represent speech dynamics •WT, HS based spectral analysis can be studied>>>
Open Problems-8
Source based audio indexing
•Useful in multimedia applications and moving audio source separation s2 Separation of moving sources
s1
Audio sources at different azimuth angles (0 to 180 degree)
s3
1.5m
•Several new method could be used for indexing Ada-boost, Tree-ICA, condition random field
Open Problems-9
Time-series prediction with EMD
•Subject to financial and environment time series •Conventional methods use Kalman filter (for smoothing) and AR model for prediction
Non-stationary time-series
•EMD can be used as smoothing filter to enhance the prediction accuracy
Open Problems-10
Heart-rate analysis with ECG data using EMD
Different parts of ECG signal
•ECG Variability analysis at different frequency region •Analysis of instantaneous ECG condition •Abnormality analysis of heart-rate using EMD based spectral modeling
The End
Questions/Suggestion Please
Source Separation Each peak region in the histogram refers to a source of the stereo mixtures Construct a binary mask (nullifying TF points of interfering sources) Mi(n,t) The HS of ith source is separated as
H i (n, t ) = M i (n, t ) H L (n, t ) Time domain ith source is given as
si (t ) = ∑ H i (n, t ) ⋅ cos[φ (n, t )] n
The End
Questions/Suggestion Please