Speaker Discrimination: The Challenge of Conversational Data - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

Speaker Discrimination: The Challenge of Conversational Data

Description:

Speaker Discrimination: The Challenge of Conversational Data Dissertation Committee Advisor: Robert Yantorno, Ph.D Members: Dennis Silage, Ph.D. Brian Butz, Ph.D. – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 76
Provided by: Anant6
Category:

less

Transcript and Presenter's Notes

Title: Speaker Discrimination: The Challenge of Conversational Data


1
Speaker DiscriminationThe Challenge of
Conversational Data
Dissertation Committee Advisor Robert
Yantorno, Ph.D Members Dennis Silage,
Ph.D. Brian Butz, Ph.D. Iyad Obeid, Ph.D. Eugene
Kwatny, Ph.d
Uchechukwu O. Ofoegbu
2
Presentation Outline
  • Problem Statement and Research Goal
  • Scope of Research
  • Distance Analysis
  • Feature Analysis
  • Data Analysis
  • Application Systems
  • Fusion of Distances
  • Proposal Summary

Dissertation Committee Advisor Robert
Yantorno, Ph.D Members Dennis Silage,
Ph.D. Brian Butz, Ph.D. Iyad Obeid, Ph.D. Eugene
Kwatny, Ph.d
3
  • Problem Statement and Research Goal

4
Conventional Speaker Recognition
  • Speaker Identification
  • Who is this speaker?
  • Speaker Verification
  • Is he who he claims to be?

System Output
5
Conversation Segmentation
  • Broadcast News/Conference Data
  • Conversational Data

6
Problems with Conversational Data
  • No a priori information available from
    participating speakers.
  • Training is impossible
  • No a priori knowledge of change points
  • Speakers alternate very rapidly.
  • Limited amounts of data for single speaker
    representations
  • Distortion
  • Channel noise, co-channel data

7
Proposed Solutions
  • Selective creation of data models
  • Development of an optimal distance measure
  • Decision level fusion of distance measures
  • Development of application-specific system

8
  • Scope of Research

9
Criminal Activity Detection
  • Monitoring inmate conversations
  • Prevention of 3-way calls
  • Notification of suspicious contacts
  • Enhancement of keyword detection
  • Uncooperative data collection
  • Forensics
  • Voiceprints

10
Commercial Services
  • Automated Customer Services
  • Personalized contact with customers
  • Search/Retrieval of Audio Data

11
Homeland Security
  • Military Activities
  • Pilot-control tower communications
  • Detection of unidentified speakers on pilot radio
    channels
  • Terrorist Identification

12
  • Distance Analysis

13
Distance Measures
  • Univariate vs. Multivariate Analysis

14
Distance Measures
  • Notations
  • Random variables being compared
  • X X1, X2, , Xp nx by p matrix
  • Y Y1, Y2, , Yp ny by p matrix
  • Properties
  • Q(X, Y) 0,
  • Q(X, Y) 0 iff X Y,
  • Q(X, Y) Q(Y, X),
  • Q(X, Y) Q(X, Z) Q(Z,Y)

15
Distance Measures
  • Mahalanobis Distance
  • QMAHANALOBIS(X,Y) (µx µy)T S-1 (µx µy)
  • S combined covariance matrix of X and Y
  • Hotellings T-Square Statistics
  • Cik ith row and kth column of the inverse of C

16
Distance Measures
  • Kullback-Leibler (KL) Distance
  • Bhattacharya Distance

17
Distance Measures
  • Levenes Test
  • Derived from T-Square statistics as follows
  • Each set of points is transformed along each
    vector into absolute divergence from the mean
    vector
  • The T-Square Statistic is then applied on the
    transformed features.

18
Procedural Set-up
  • HTIMIT database used
  • Average Utterance Length 5 seconds
  • Intra-speaker distance computations

Randomly Select 2 Utterances
19
Procedural Set-up
  • Inter-speaker, different utterances distance
    computations

Randomly Select Utterance
Randomly Select Utterance
20
Analysis of Distance Measures
  • Mahalanobis Distance Gaussian Estimate

21
Analysis of Distance Measures
  • Levenes Test Gamma Estimate

22
  • Feature Analysis

23
Cepstral Analysis
Frequency Analysis of Speech
Excitation Component
Vocal Tract Component
STFT of Speech
Slowly varying formants
Fast varying harmonics

X
Log of STFT
Log of Excitation
Log of Vocal Tract Component


IDFT of Log of STFT
Excitation
Vocal tract


24
Cepstral Features
  • Linear Predictive Cepstral Coefficients
  • Obtained Recursively from LPC Coefficients
  • Mel-Scale Frequency Cepstral Coefficients
  • Nonlinear warping of frequency axis to model the
    human auditory system

25
Cepstral Features
  • Delta Cepstral Coefficients
  • First and Second derivatives of cepstral
    coefficients
  • Reflects dynamic information
  • Used as supplement to original cepstral features

26
Analysis of Cepstral Features
  • Mahalanobis Distance

27
Analysis of Cepstral Features
  • Levenes Test

28
Feature Combination
  • Proposed Investigation
  • Whats the best feature combination?
  • Will the delta and delta-delta coefficients
    contribute to the speaker differentiating ability
    of the features.

29
Feature Combination Analysis
  • T-test Based Evaluation
  • Why?
  • Robust to the Gaussian distribution especially
    for amounts of data sizes and when the two
    samples to be compared have approximately equal
    values.
  • Unaffected by differences in the variances of the
    compared variables

30
  • Data Analysis

31
Traditional Speaker Modeling
  • Examples
  • Gaussian Mixture Models
  • Hidden Markov Models
  • Neural Networks
  • Prosody-Based Models
  • Disadvantages
  • Require large amounts
  • Sometimes require training procedure
  • Relatively complex

32
Conversational Data Modeling
  • Current Method
  • Equal Segmentation of Data
  • Indiscriminate use of data
  • Poor performance
  • Problems
  • Change points unknown
  • Not all speech is useful

33
Proposed Speaker Modeling
SEGMENT 1
SEGMENT M
FEATURE COMPUTATION
FEATURE COMPUTATION
. . .
MODEL 1
MODEL M
34
Proposed Speaker Modeling
  • Why voiced only
  • Same speech class compared
  • Contains the most information
  • Whats the appropriate number of phonemes
  • Large enough to sufficiently represent speakers
  • Small enough to avoid speaker overlap

35
Modeling Analysis
N 20 4 seconds of voiced
speech
36
Modeling Analysis
37
Modeling Analysis
N 5 1 second of voiced
speech
38
  • Applications Systems

39
Unsupervised Speaker Indexing
  • The Restrained-Relative Minimum Distance (RRMD)
    Approach

REFERENCE MODELS
0 D1,2 D1,3 D2,1 0 D2,3
D3,1 D3,2 0
0 D1,2 D1,3 D2,1 0 D2,3
D3,1 D3,2 0
40
Unsupervised Speaker Indexing
  • The Restrained-Relative Minimum Distance (RRMD)
    Approach

Observe distance
Reference 2
Reference 1
Unusable Data
Failed
Min. Distance
Relative Distance Condition
Failed
Restraining Condition
Passed
Same Speaker?
Same Speaker
Passed
41
RRMD Approach
  • Restraining Condition
  • Distance Likelihood Ratio
  • DLR gt 1 ? Same Speaker
  • DLR lt 1 ? Check Relative
  • Distance Condition

42
RRMD Approach
  • Relative Distance Condition
  • Relative Distance
  • Drel dmax dmin
  • Drel gt threshold
  • ? Same Speaker

dmin
dmax
43
Preliminary Results
  • Experiments
  • 245 telephone conversations from the SWITCHBOARD
    database, with an average length of 400 seconds.
  • T-Square statistics used
  • Ground truth obtained from Mississippi State
    Transcriptions

44
Preliminary Results
  • Best N Estimation

N 5
45
Preliminary Results
  • RRMD Experiments
  • Drel Varied from 0-200
  • Two Errors Defined
  • Indexing Error
  • Ierr 100 Accuracy,
  • Undecided Error
  • Nu number of detected undecided/unusable
    samples,
  • Nc number labeled as co-channel data
  • undecided error

46
Preliminary Results
47
Speaker Count System
  • The Residual Ratio Algorithm (RRA)
  • Process is repeated K-1 times for counting up to
    K speakers

Too little data Removed, select Another model
DLR-based Model Comparison
DLR-based Model Comparison
. . .
48
RRA Examples 2 Speakers

49
RRA Examples 3 Speakers
50
Comparison
TWO-SPEAKER RESIDUAL
THREE-SPEAKER RESIDUAL
Residual Ratio after 2nd round of RRA
Residual Ratio after 2nd round of RRA
Speaker 2
51
Preliminary Results
  • Experiments
  • HTIMIT Database
  • 1000 artificially generated K-speaker
    conversations (each) for K1-4
  • Average conversation length 1min
  • Mahalanobis distance used

52
Preliminary Results
  • Counting Techniques
  • Stopped Residual Ratio (SRR)
  • Added Residual Ratio (ARR)
  • speaker count determined based on the sum of the
    Residual Ratios for all K-1 rounds. The higher
    the ARR higher speaker count

53
Preliminary Results
54
Preliminary Results
55
  • Fusion of Distances

56
Correlation Analysis
57
Correlation Analysis
58
Best Distance
  • Optimized Fusion of Distances
  • Minimize inter-speaker variation
  • Maximize intra-speaker variation
  • Maximize T-test value between inter-class
    distance distributions

Tmax New Distance X vector consisting of the
distance measure values a vector of the
weights assigned to each distance measure
59
Best Distance


Distance Measure 2
Distance Measure 1
60
Preliminary Experiments

LPCCs
61
Preliminary Experiments

LPCCs
62
Preliminary Experiments

MFCCs
63
Preliminary Experiments

MFCCs
64
  • Proposal Summary

65
Research Goal Revisited
  • To overcome the following challenges faced in
    between differentiating speakers participating in
    conversations
  • No a priori information
  • Limited data size
  • No knowledge of change points
  • Co-channel speech

66
Summary of Work Accomplished
  • Practically demonstration of the existence of the
    problem.
  • Analysis of distance measures and features
  • Development of a novel model formation technique
  • Development, implementation and evaluation of two
    conversations-based speaker differentiation
    systems
  • Introduction to and preliminary testing of an
    optimal distance formation

67
Proposed Work
  • Features Combinations
  • Determination of the best combination of features
    using univariate tests of similarity
  • Enhancement of feature combinations using
    Principal Component Analysis.
  • Fusion of Distance measure
  • Enhancement of fusion technique using mutual
    information suppression techniques
  • Decision-level distance measure fusion

68
Proposed Work
  • Further development of introduced systems
  • Use of all distance measures
  • Use of best feature combination
  • The use of the optimal distance
  • Implementation of decision-level fusion technique

69
Final Goal
  • A speaker recognition system for conversations
    yields results which are comparable to
    non-conversational systems.

70
Publications
  • U. Ofoegbu, A. Iyer, R. Yantorno, Detection of a
    Third Speaker in Telephone Conversations, ICSLP,
    INTERSPEECH 2006
  • U. Ofoegbu, A. Iyer, R. Yantorno and S. Wenndt,
    Unsupervised Indexing of Noisy conversations
    with Short Speaker Utterances, IEEE Aerospace
    Conference. March, 2007  
  • U. Ofoegbu, A. Iyer, R. Yantorno, A Simple
    Approach to Unsupervised Speaker Indexing, IEEE
    ISPACS. 2006.
  • U. Ofoegbu, A. Iyer, R. Yantorno, A Speaker
    Count System for Telephone Conversations, IEEE
    ISPACS. 2006.
  •  A. Iyer, U. Ofoegbu, R. Yantorno, Speaker
    Discriminative Distances Comprehensive Study,
    IEEE Transactions on Speech and Audio Processing.
    (Submitted).

71
Dissertation Committee Advisor Robert
Yantorno, Ph.D Members Dennis Silage,
Ph.D. Brian Butz, Ph.D. Iyad Obeid, Ph.D. Eugene
Kwatny, Ph.d
72
Cepstral Features
  • Linear Predictive Cepstral Coefficients
  • Obtained Recursively from LPC Coefficients

Let LPC vector a0 a1 a2 ap   and LPCC
vector c0 c1 c2 cp c0 c1 c2 cn-1     
73
Conversational Data Modeling
  • Current Method
  • Equal Segmentation of Data
  • Indiscriminate use of data
  • Problems
  • Change points unknown
  • Not all speech is useful

74
Best Distance

  • Intra-speaker and inter-speaker distance lengths
    are always equal, therefore
  • P sum of the covariance matrices of the
    two classes.
  • ?1 maximum eigenvalue obtained by solving
    the
  • generalized eigenvalue problem
  • Q is the square of the distance between the
    mean vectors
  • of the two classes

75
RRMD Approach
  • Relative Distance Condition
Write a Comment
User Comments (0)
About PowerShow.com