IRISA 2003 SPEAKER RECOGNITION SYSTEM - PowerPoint PPT Presentation

About This Presentation
Title:

IRISA 2003 SPEAKER RECOGNITION SYSTEM

Description:

IRISA is a member of the ELISA consortium ... SYS_fs3 : ML selection (LogE) SYS_fs4 : optimal threshold-based selection (LogE) : c = 2.5 ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 17
Provided by: conqst
Category:

less

Transcript and Presenter's Notes

Title: IRISA 2003 SPEAKER RECOGNITION SYSTEM


1
IRISA 2003 SPEAKER RECOGNITION SYSTEM
  • 1sp DETECTION
  • Limited Data
  • M. BEN, G. GRAVIER, A. OZEROV F. BIMBOT
  • for the ELISA consortium

NIST Speaker Recognition Workshop, June 24-25,
2003
2
Outline
  • IRISA 2003 system
  • Introduction
  • Description
  • NIST03 SRE results
  • Experiments
  • Front-end
  • Modeling
  • Score normalization
  • Conclusions

3
IRISA 2003 system
  • Introduction

? IRISA is a member of the ELISA consortium ?
IRISA 2003 system is based on a newly developed
audio segmentation software audioseg Web
links - IRISA/METISS http//www.irisa.fr/meti
ss/accueil.html - ELISA consortium
http//elisa.ddl.ish-lyon.cnrs.fr
4
IRISA 2003 system
  • Description
  • front-end
  • 20 ms frames every 10 ms
  • 24 filter bank over 340 - 3400 Hz ? 16 LFCC
  • RASTA filtering (secondary system)
  • deltas delta log-energy are added
  • frame selection bi-gaussian modeling of the
    energy with ML classification of the frames
    (speech/silence)
  • global feature normalization (zero mean, unit
    var.)

5
IRISA 2003 system
  • Description
  • background modeling
  • speaker models
  • gender-dependent background models
  • 256 GMMs with diagonal covariance matrices
  • prim. system cellular data (NIST01)
  • second. system cellularlandline data
    (NIST01)
  • adapted from the background models with MAP
    estimation of the parameters (mean only
    adaptation)

6
IRISA 2003 system
  • Description
  • scoring
  • frame score
  • log-likelihood ratio using the 10-best
  • matching gaussians in the background model
  • utterance score
  • NT number of frames in the utterance


7
IRISA 2003 system
  • Description
  • score normalization DT-norm
  • D-norm
  • D(spk) symmetric Kullback-Leibler
    distance
  • between the speaker (spk) and
  • the background models
  • DT-norm
  • mean and standard deviation of the
  • D-norm scores of the test utterance
  • using cohort impostor models (50 mal.
  • 50 fem. from NIST01 SRE)

8
IRISA 2003 system
  • NIST03 SRE results 1sp-limited
  • DET curves
  • 2 systems submited
  • IRI_1 primary
  • baseline system
  • IRI_2 secondary
  • RASTA front-end
  • mixed cell.land. data for world models

DCF min actual IRI_1
0.3176 0.3205 IRI_2 0.3333
0.3396
9
Experiments
  • Front-end frame selection
  • speech/silence classification based on a
  • bi-gaussian modeling of the frame energy
  • ML classification
  • or
  • threshold-based selection ?
  • ( t ?2 - c.?2 )
  • constant coef. to optimise

G1(?1 ,?1)
G2 (?2,?2)
energy
10
Experiments
  • Front-end frame selection
  • speech/silence classification based on a
  • bi-gaussian modeling of the frame log-energy
  • ML classification
  • or
  • threshold-based selection ?
  • ( t ?2 - c.?2 )
  • constant coef. to optimise

G1(?1 ,?1)
G2 (?2,?2)
log-energy
11
Experiments
  • Front-end frame selection
  • SYS_fs1 ML selection (E)
  • SYS_fs2 optimal threshold-based selection (E)
    c 0.8
  • SYS_fs3 ML selection (LogE)
  • SYS_fs4 optimal threshold-based selection
    (LogE) c 2.5
  • energy (E) bi-gauss. modeling with ML selection
    of the frames performs the best
  • drastic selection about 50 of the frames are
    discarded !

NIST 03 SRE data
12
Experiments
  • Front-end feature normalization

- st-norm short-term norm. (0 mean, unit
var.) on a sliding window (3 sec.) - lt-norm
long term norm. (0 mean, unit var.) on all
features
  • st-norm is applied before frame
  • selection
  • lt-norm can be applied before or
  • after frame selection
  • SYS_fn1 lt-norm frame selection
  • SYS_fn2 st-norm frame selection
  • SYS_fn3 frame selection lt-norm

NIST 02 SRE data (subset)
13
Experiments
  • Front-end feature normalization
  • - SYS_fn5 frame selection lt-norm
  • baseline system (prim.)
  • SYS_fn6 st-norm frame selection
    lt-norm
  • short-term normalization does not seem to work
    well (buggy?)
  • long-term normalization at the end of front-end
    seems to be crucial
  • best results obtained with frame selection
    followed by long-term normalization of remaining
    features

NIST 03 SRE data
14
Experiments
  • Modeling
  • Does size matter ?
  • - SYS_nbg1 256 component GMMs
  • (baseline)
  • SYS_nbg2 2048 component GMMs
  • no gain of performance with 2048 gaussians in
    the mixture
  • may be due to the frame selection process which
    remove a large amount of frames (?)

NIST 02 SRE data (subset)
15
Experiments
  • Score normalization
  • SYS_sn1 no score norm.
  • SYS_sn2 T-norm
  • SYS_sn3 DT-norm
  • SYS_sn4 DZT-norm
  • all score normalizations improve performance
  • DT-norm seems to perform better than T-norm and
    DZT-norm at minimum DCF point

NIST 02 SRE data (subset)
16
Conclusions
  • validation of the new toolkit audioseg
  • new baseline system performs well
  • frame selection is crucial for good performance
  • work on feature transformations (PCA, ICA ...)
  • model adaptation on test data
  • hierarchical structural model adaptation
  • IRISA participation to NIST03 SRE
  • Perspectives
Write a Comment
User Comments (0)
About PowerShow.com