Title: Speaker Discrimination: The Challenge of Conversational Data
1Speaker DiscriminationThe Challenge of
Conversational Data
Dissertation Committee Advisor Robert
Yantorno, Ph.D Members Dennis Silage,
Ph.D. Brian Butz, Ph.D. Iyad Obeid, Ph.D. Eugene
Kwatny, Ph.d
Uchechukwu O. Ofoegbu
2Presentation Outline
- Problem Statement and Research Goal
- Scope of Research
- Distance Analysis
- Feature Analysis
- Data Analysis
- Application Systems
- Fusion of Distances
- Proposal Summary
Dissertation Committee Advisor Robert
Yantorno, Ph.D Members Dennis Silage,
Ph.D. Brian Butz, Ph.D. Iyad Obeid, Ph.D. Eugene
Kwatny, Ph.d
3- Problem Statement and Research Goal
4Conventional Speaker Recognition
- Speaker Identification
- Who is this speaker?
- Speaker Verification
- Is he who he claims to be?
System Output
5Conversation Segmentation
- Broadcast News/Conference Data
- Conversational Data
6Problems with Conversational Data
- No a priori information available from
participating speakers. - Training is impossible
- No a priori knowledge of change points
- Speakers alternate very rapidly.
- Limited amounts of data for single speaker
representations - Distortion
- Channel noise, co-channel data
7Proposed Solutions
- Selective creation of data models
- Development of an optimal distance measure
- Decision level fusion of distance measures
- Development of application-specific system
8 9Criminal Activity Detection
- Monitoring inmate conversations
- Prevention of 3-way calls
- Notification of suspicious contacts
- Enhancement of keyword detection
- Uncooperative data collection
- Forensics
- Voiceprints
10Commercial Services
- Automated Customer Services
- Personalized contact with customers
- Search/Retrieval of Audio Data
11Homeland Security
- Military Activities
- Pilot-control tower communications
- Detection of unidentified speakers on pilot radio
channels - Terrorist Identification
12 13Distance Measures
- Univariate vs. Multivariate Analysis
14Distance Measures
- Notations
- Random variables being compared
- X X1, X2, , Xp nx by p matrix
- Y Y1, Y2, , Yp ny by p matrix
- Properties
- Q(X, Y) 0,
- Q(X, Y) 0 iff X Y,
- Q(X, Y) Q(Y, X),
- Q(X, Y) Q(X, Z) Q(Z,Y)
15Distance Measures
- Mahalanobis Distance
- QMAHANALOBIS(X,Y) (µx µy)T S-1 (µx µy)
- S combined covariance matrix of X and Y
- Hotellings T-Square Statistics
- Cik ith row and kth column of the inverse of C
16Distance Measures
- Kullback-Leibler (KL) Distance
- Bhattacharya Distance
17Distance Measures
- Levenes Test
- Derived from T-Square statistics as follows
- Each set of points is transformed along each
vector into absolute divergence from the mean
vector - The T-Square Statistic is then applied on the
transformed features.
18Procedural Set-up
- HTIMIT database used
- Average Utterance Length 5 seconds
- Intra-speaker distance computations
Randomly Select 2 Utterances
19Procedural Set-up
- Inter-speaker, different utterances distance
computations
Randomly Select Utterance
Randomly Select Utterance
20Analysis of Distance Measures
- Mahalanobis Distance Gaussian Estimate
21Analysis of Distance Measures
- Levenes Test Gamma Estimate
22 23Cepstral Analysis
Frequency Analysis of Speech
Excitation Component
Vocal Tract Component
STFT of Speech
Slowly varying formants
Fast varying harmonics
X
Log of STFT
Log of Excitation
Log of Vocal Tract Component
IDFT of Log of STFT
Excitation
Vocal tract
24Cepstral Features
- Linear Predictive Cepstral Coefficients
- Obtained Recursively from LPC Coefficients
- Mel-Scale Frequency Cepstral Coefficients
- Nonlinear warping of frequency axis to model the
human auditory system
25Cepstral Features
- Delta Cepstral Coefficients
- First and Second derivatives of cepstral
coefficients - Reflects dynamic information
- Used as supplement to original cepstral features
26Analysis of Cepstral Features
27Analysis of Cepstral Features
28Feature Combination
- Proposed Investigation
- Whats the best feature combination?
- Will the delta and delta-delta coefficients
contribute to the speaker differentiating ability
of the features.
29Feature Combination Analysis
- T-test Based Evaluation
- Why?
- Robust to the Gaussian distribution especially
for amounts of data sizes and when the two
samples to be compared have approximately equal
values. - Unaffected by differences in the variances of the
compared variables
30 31Traditional Speaker Modeling
- Examples
- Gaussian Mixture Models
- Hidden Markov Models
- Neural Networks
- Prosody-Based Models
- Disadvantages
- Require large amounts
- Sometimes require training procedure
- Relatively complex
32Conversational Data Modeling
- Current Method
- Equal Segmentation of Data
- Indiscriminate use of data
- Poor performance
- Problems
- Change points unknown
- Not all speech is useful
33Proposed Speaker Modeling
SEGMENT 1
SEGMENT M
FEATURE COMPUTATION
FEATURE COMPUTATION
. . .
MODEL 1
MODEL M
34Proposed Speaker Modeling
- Why voiced only
- Same speech class compared
- Contains the most information
- Whats the appropriate number of phonemes
- Large enough to sufficiently represent speakers
- Small enough to avoid speaker overlap
35Modeling Analysis
N 20 4 seconds of voiced
speech
36Modeling Analysis
37Modeling Analysis
N 5 1 second of voiced
speech
38 39Unsupervised Speaker Indexing
- The Restrained-Relative Minimum Distance (RRMD)
Approach
REFERENCE MODELS
0 D1,2 D1,3 D2,1 0 D2,3
D3,1 D3,2 0
0 D1,2 D1,3 D2,1 0 D2,3
D3,1 D3,2 0
40Unsupervised Speaker Indexing
- The Restrained-Relative Minimum Distance (RRMD)
Approach
Observe distance
Reference 2
Reference 1
Unusable Data
Failed
Min. Distance
Relative Distance Condition
Failed
Restraining Condition
Passed
Same Speaker?
Same Speaker
Passed
41RRMD Approach
- Restraining Condition
- Distance Likelihood Ratio
- DLR gt 1 ? Same Speaker
- DLR lt 1 ? Check Relative
- Distance Condition
42RRMD Approach
- Relative Distance Condition
- Relative Distance
- Drel dmax dmin
- Drel gt threshold
- ? Same Speaker
dmin
dmax
43Preliminary Results
- Experiments
- 245 telephone conversations from the SWITCHBOARD
database, with an average length of 400 seconds. - T-Square statistics used
- Ground truth obtained from Mississippi State
Transcriptions
44Preliminary Results
N 5
45Preliminary Results
- RRMD Experiments
- Drel Varied from 0-200
- Two Errors Defined
- Indexing Error
- Ierr 100 Accuracy,
- Undecided Error
- Nu number of detected undecided/unusable
samples, - Nc number labeled as co-channel data
- undecided error
-
46Preliminary Results
47Speaker Count System
- The Residual Ratio Algorithm (RRA)
- Process is repeated K-1 times for counting up to
K speakers
Too little data Removed, select Another model
DLR-based Model Comparison
DLR-based Model Comparison
. . .
48RRA Examples 2 Speakers
49RRA Examples 3 Speakers
50Comparison
TWO-SPEAKER RESIDUAL
THREE-SPEAKER RESIDUAL
Residual Ratio after 2nd round of RRA
Residual Ratio after 2nd round of RRA
Speaker 2
51Preliminary Results
- Experiments
- HTIMIT Database
- 1000 artificially generated K-speaker
conversations (each) for K1-4 - Average conversation length 1min
- Mahalanobis distance used
52Preliminary Results
- Counting Techniques
- Stopped Residual Ratio (SRR)
- Added Residual Ratio (ARR)
- speaker count determined based on the sum of the
Residual Ratios for all K-1 rounds. The higher
the ARR higher speaker count
53Preliminary Results
54Preliminary Results
55 56Correlation Analysis
57Correlation Analysis
58Best Distance
- Optimized Fusion of Distances
- Minimize inter-speaker variation
- Maximize intra-speaker variation
- Maximize T-test value between inter-class
distance distributions
Tmax New Distance X vector consisting of the
distance measure values a vector of the
weights assigned to each distance measure
59Best Distance
Distance Measure 2
Distance Measure 1
60Preliminary Experiments
LPCCs
61Preliminary Experiments
LPCCs
62Preliminary Experiments
MFCCs
63Preliminary Experiments
MFCCs
64 65Research Goal Revisited
- To overcome the following challenges faced in
between differentiating speakers participating in
conversations - No a priori information
- Limited data size
- No knowledge of change points
- Co-channel speech
66Summary of Work Accomplished
- Practically demonstration of the existence of the
problem. - Analysis of distance measures and features
- Development of a novel model formation technique
- Development, implementation and evaluation of two
conversations-based speaker differentiation
systems - Introduction to and preliminary testing of an
optimal distance formation
67Proposed Work
- Features Combinations
- Determination of the best combination of features
using univariate tests of similarity - Enhancement of feature combinations using
Principal Component Analysis. - Fusion of Distance measure
- Enhancement of fusion technique using mutual
information suppression techniques - Decision-level distance measure fusion
68Proposed Work
- Further development of introduced systems
- Use of all distance measures
- Use of best feature combination
- The use of the optimal distance
- Implementation of decision-level fusion technique
69Final Goal
-
- A speaker recognition system for conversations
yields results which are comparable to
non-conversational systems.
70Publications
- U. Ofoegbu, A. Iyer, R. Yantorno, Detection of a
Third Speaker in Telephone Conversations, ICSLP,
INTERSPEECH 2006 - U. Ofoegbu, A. Iyer, R. Yantorno and S. Wenndt,
Unsupervised Indexing of Noisy conversations
with Short Speaker Utterances, IEEE Aerospace
Conference. March, 2007 Â - U. Ofoegbu, A. Iyer, R. Yantorno, A Simple
Approach to Unsupervised Speaker Indexing, IEEE
ISPACS. 2006. - U. Ofoegbu, A. Iyer, R. Yantorno, A Speaker
Count System for Telephone Conversations, IEEE
ISPACS. 2006. - Â A. Iyer, U. Ofoegbu, R. Yantorno, Speaker
Discriminative Distances Comprehensive Study,
IEEE Transactions on Speech and Audio Processing.
(Submitted).
71 Dissertation Committee Advisor Robert
Yantorno, Ph.D Members Dennis Silage,
Ph.D. Brian Butz, Ph.D. Iyad Obeid, Ph.D. Eugene
Kwatny, Ph.d
72Cepstral Features
- Linear Predictive Cepstral Coefficients
- Obtained Recursively from LPC Coefficients
Let LPC vector a0 a1 a2 ap  and LPCC
vector c0 c1 c2 cp c0 c1 c2 cn-1Â Â Â Â Â
73Conversational Data Modeling
- Current Method
- Equal Segmentation of Data
- Indiscriminate use of data
- Problems
- Change points unknown
- Not all speech is useful
74Best Distance
- Intra-speaker and inter-speaker distance lengths
are always equal, therefore - P sum of the covariance matrices of the
two classes. - ?1 maximum eigenvalue obtained by solving
the - generalized eigenvalue problem
- Q is the square of the distance between the
mean vectors - of the two classes
75RRMD Approach
- Relative Distance Condition