Speech and Crosstalk Detection in Multichannel Audio - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Speech and Crosstalk Detection in Multichannel Audio

Description:

... de Leon used signal kurtosis to discriminating overlapped speech ... B. Kurtosis. C. Fundamentalness. D. Spectral Autocorrelation Peak-Valley Ratio (SAPVR) ... – PowerPoint PPT presentation

Number of Views:82

Avg rating:3.0/5.0

Slides: 28

Provided by: jenwe

Category:

more less

Transcript and Presenter's Notes

Title: Speech and Crosstalk Detection in Multichannel Audio

1
Speech and Crosstalk Detectionin Multichannel
Audio

Stuart N. Wrigley,Member,IEEE,
Guy J. Brown,Vincent Wan,and Steve
Renals,Member,IEEE

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING,
VOL. 13, NO. 1, JANUARY 2005
Presenter?Ting-Wei Hsu
2
Introduction
Channel 1
Headset Microphone
Channel 2
A
Channel 3
B
C
Meeting Table
Describing two experiments related to the
automatic classification of audio into four
classes.
3
Introduction (cont.)
The first experiment attempted to optimize a set
of acoustic features for use with a Gaussian
mixture model (GMM) classifier.
GMM
eHMM
The second experiment used these features to
train an ergodic hidden Markov model classifier.
Goal 1. Producing accurate labels (accuracy
96) 2. Indicating if the local speaker
(S) is active.
4
Outline

Detection Crosstalk Previous approaches
Detection Crosstalk Ergodic hidden Markov model
(eHMM)
Acoustic Features
Statistical Framework
Experiments
Feature Selection Experiments
Multistream eHMM Classfication Experiments
Evaluation Using ASR

5
Detection Crosstalk Previous approaches

Higher-order statistics
LeBlanc and de Leon used signal kurtosis to
discriminating overlapped speech from
nonoverlapped speech.
Signal processing techniques
Morgan et al. proposed a harmonic enhancement and
suppression system for separating two speakers.
(stronger and weaker)
Krishnamachari et al. proposed spectral
autocorrelation peak valley ratio (SAPVR).
Statistical pattern recognition
Zissman et al. trained a Gaussian classifier
using mel-frequency cepstral coefficients (MFCCs).

These approaches attempt to identify only two
speakers are active.
6
Detection Crosstalk eHMM

Pfau et al. proposed a detector using an ergodic
hidden Markov model (eHMM).
The eHMM consisted of four states S, SC, C and
SIL.
Each state was trained using features such as
critical band loudness values, energy, and
zero-crossing rate.

A
Short-time cross-correlation is computed to
assess similarity.
For each pair which exhibited high similarity
(i.e., the same speaker was active in both
channels), the channel with the lower energy was
assumed to be crosstalk.
B
Multi speakers
C
7
Acoustic Features

A. MFCC,Energy,and Zero Crossing Rate
B. Kurtosis
C. Fundamentalness
D. Spectral Autocorrelation Peak-Valley Ratio
(SAPVR)
E. Pitch Prediction Feature (PPF)
F. Features Derived From Genetic Programming
G. Cross-Channel Correlation

Each feature is used to analyzing the
differences
between isolated and overlapping speech.

8
A. MFCC,Energy,and Zero Crossing Rate

MFCC features for 20 critical bands up to 8 kHz
were extracted.
The short-time log energy and zero crossing rate
(ZCR) were also computed.

9
B. Kurtosis

Kurtosis is the fourth-order moment of a signal
divided by the square of its second-order moment.
It has been shown that the kurtosis of
overlapping speech is generally less than the
kurtosis of isolated speech utterances.

10
C. Fundamentalness

Kawahara et al. describe an approach to
estimating the fundamentalness of an harmonic.
If more than one fundamental is present,
interference of the two components introduces
modulation, thus decreasing the fundamentalness
measure.

Dual speaker
Single speaker
11
D. Spectral Autocorrelation Peak-Valley Ratio

Spectral autocorrelation peak-valley ratio
(SAPVR) is computed from the autocorrelation of
the signal spectrum obtained from a short-time
Fourier transform.
When more than one speaker is active
simultaneously, the autocorrelation function
becomes flatter due to the overlapping harmonic
series.

12
E. Pitch Prediction Feature (PPF)

Steps
Computing 12th-order linear prediction filter
coefficients (LPCs).
Using LPCs to calculate the LP residual (error
signal).
Defining the standard deviation of the distance
between successive peaks.
If a frame contains a single speaker, a regular
sequence of peaks will occur in the LP residual
which correspond to glottal closures. Therefore,
the standard deviation of the interpeak
differences will be small.

13
F. Features Derived From Genetic Programming

A genetic programming (GP) approach was also used
to identify frame-based features that could be
useful for signal classification.
The GP engine identified several successful
features, of which three were included in the
following feature selection process

GP1 rms(zerocross(abs(diff(x)))) GP2
max(autocorr(normalize(x))) GP3
min(log10(abs(diff(x))))
? MATLAB functions
14
G. Cross-Channel Correlation

For each channel i , the maximum of the
cross-channel correlation at time
between channel j and each other channel was
computed

(1)
correlation lag signal from channel i
signal from channel j window size Hamming
window
15
G. Cross-Channel Correlation (cont.)

Two normalization way
The feature set for channel i was divided by
the frame energy of channel i.
Spherical normalization, in which the cross
correlation is divided by the square-root
of the autocorrelations for channels i and j
plus some nonzero constant to prevent information
loss.

16
Statistical Framework

The probability density function p(x) of each
four state in eHMM is modeled by a Gaussian
mixture model (GMM).
Each GMM was trained by Expectation-maximization
(EM) algorithm.
The likelihood of each state k having generated
the data at time frame t is combined
with transition probabilities to determine the
mostly likely state

(2)
(3)
17
Statistical Framework (cont.)

Set some transition constraints
When considering m observations (audio channels),
the state space contains all permutations of
S(m-1)C
qSCnC
mSIL , where 2 lt q lt m and n m - q
Reducing the size of the state space.

18
Statistical Framework (cont.)

We base our feature selection approach on the
area under the ROC curve (AUROC) for a particular
classifier.
Using the sequential forward selection (SFS)
algorithm to computes the AUROC for GMM
classifiers trained on each individual feature.
The feature set resulting in the highest AUROC is
selected.

19
Statistical Framework (cont.)

Sequential Forward Selection (SFS) Algorithm
In this experiments, the SFS algorithm always
terminated with fewer than six features for all
crosstalk categories.

CF critical function
20
Experiments

Feature Selection Experiments
Multistream eHMM Classfication Experiments

21
Feature Selection Experiments

Individual feature performance for
each classification category.
Values indicate the percentage of the
true positives at equal error rates.

22
Feature Selection Experiments (cont.)

The feature sets derived by the SFS algorithm
were selected.
The under four pictures indicate the ROC
performance curves for each crosstalk categorys
optimum feature set.
Diagonal lines indicate equal error rates.
Dashed curves indicate performance when log
energy is excluded from the set of potential
features.
For equal false positive and false negative
rates, the performance of each classifier is
approximately 80.

23
Multistream eHMM Classfication Experiments

This experiment used features to train an ergodic
hidden Markov model classifier.
The eHMM classification performances are shown

1.High true positive rate
3.Poor recording
2.Relative
Upper-line True Positive Rate Lower-line
False Positive Rate
24
Multistream eHMM Classfication Experiments (cont.)

Two applications for such a classification system
are speech recognition preprocessing and speaker
turn analysis.
Both of these rely on accurate detection of
local speaker activity, which is largely
equivalent to the speaker alone (S) channel
classification.
These applications require the accurate
classification of contiguous segments of audio.
The segment-level performance is similar to that
of the frame-level approach.

25
Evaluation Using ASR

Segment and ASR word accuracies on whole
meetings
Ehe eHMM classifier has a segment recognition
accuracy of between 83 and 92 for single
speaker detection.
The results indicate that the eHMM classifier is
capable of detecting most of the frames required
for optimal ASR.

Voice activity detector emphasizes on energy.

26
Evaluation Using ASR (cont.)

ASR performance for meetings bmr001, bro018, and
bmr018.
Note that the VAD classifier failed on a number
of channels and hence, some data points (channels
0 and 8 from bmr001 and channel 8 from bmr018)
are missing.
The inconsistent VAD ASR results emphasise that
an energy based measure for speaker detection is
highly unreliable.

27
Thank You!

Write a Comment

User Comments (0)