Speaker Localization: introduction to system evaluation - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Speaker Localization: introduction to system evaluation

Description:

Center for Scientific and Technological Research ... (fixed. Pan. Tilt. Zoom. Camera. NIST. MARKIII. IRST Light. Microphone. Array. Screen. Lecturer ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 28
Provided by: velblodVid
Category:

less

Transcript and Presenter's Notes

Title: Speaker Localization: introduction to system evaluation


1
Speaker Localization introduction to system
evaluation
  • Maurizio Omologo
  • with contributions by
  • Alessio Brutti, Luca Cristoforetti, Piergiorgio
    Svaizer
  • ITC-irst, Povo, Trento, Italy

NIST Rich Transcription05 Evaluation
Workshop, Edinburgh, July 13th, 2005
2
Outline
  • Acoustic source/Speaker LOCalization (SLOC) and
    tracking general issues
  • The localization problem in the lecture scenario
    of CHIL
  • Evaluation Criteria
  • Software developed at IRST
  • Experimental results
  • Examples
  • Description of IRST systems being evaluated

3
Speaker localization general issues
  • Problem definition locate and track, in 2D or
    3D, active speakers (one or more) in a given
    multi-speaker scenario
  • 2D localization the source is assumed to be
    located in the plane where acoustic sensors are
    placed
  • Forced assumption the speaker, and any other
    acoustic source, are assumed to be point sources,
    in general slowly moving and emitting large-band
    unstationary signals. Radiation effects are
    neglected.
  • Key technical aspect acoustic sensor signals are
    very different each other. Their characteristics
    depend on speaker positions, room acoustics
    (reflections, reverberation), background
    noise,etc
  • Most common approach
  • 0) Detect an acoustic event (see the Speech
    Activity Detection problem)
  • 1) compute Time Difference of Arrival (TDOA) at
    different microphone pairs
  • 2) derive source position estimate from geometry,
    and
  • 3) apply possible constraints (e.g. to ignore
    locations outside the room)

4
Example of very near-field propagation
Distance between the microphones of a pair 12
cm Speed of sound 340 m/s
In general, TDOA is computed on the basis of
coherence in the direct wavefront
For far field, plane propagation can be assumed
Animation courtesy of Dr. Dan Russell, Kettering
University
5
T-shaped arrays in the CHIL room at IRST
Speaker area
Animation courtesy of Dr. Dan Russell, Kettering
University
6
Speaker localization general issues
  • Sensory system characteristics
  • Number of microphones
  • Sensitivity, spatial and spectral response of the
    microphones
  • Position of each microphone in the room
    need of a calibration step
  • Information required for the evaluation
  • Reference time stamps
  • sample level synchronous recordings of all the
    microphones, or
  • need of the offset information to time align
    signals recorded by different acquisition
    platforms
  • Offset information to time align audio and video
    recordings
  • Ground truth 3D labels for each active speaker
    in general they are derived from a set of
    calibrated video-cameras, a mouth tracking
    algorithm, and a final manual check.
    In CHIL, 3D labels of the lecturer are updated
    every 667 ms.
  • The output sequence of time stamps 3D labels

7
CHIL Speaker localization in lecture scenarios
  • Common sensor set-up in CHIL consortium
  • 3 T-shaped mic.arrays, 2 close-talk, 1 MarkIII
    array
  • optional use of all the microphones available in
    a site
  • UKA set-up
  • 4 T-shaped arrays,
  • One/two Country-man close-talk,
  • One NIST-MarkIII (IRST-light version),
  • table-top mics

8
Evaluation Criteria
  • Accurate (Lecturer) vs Rough localization
    (Audience)
  • Type of localization errors
  • Fine (errorlt 50cm for lecturer,100 cm for
    audience)
  • Gross (otherwise)

9
Evaluation Criteria fine and gross errors

Camera
(fixed
)


Pan
-
Tilt
-
Zoom
Screen


Camera





NIST
-
Table for meetings
MARKIII
-
IRST Light


Microphone


Array



10
Evaluation Criteria
  • Accurate (Lecturer) vs Rough localization
    (Audience)
  • Type of localization error
  • Fine (errorlt 50cm for lecturer,100 cm for
    audience)
  • Gross (otherwise)
  • Speech Activity Detection (external vs internal
    to the localization system)
  • False alarm rate
  • Deletion rate
  • Average Frame rate (/s)

Finegross represents the most relevant cue for
SLOC accuracy evaluation
11
SLOC error computation in a time interval
12
Evaluation Criteria
  • Accurate (Lecturer) vs Rough localization
    (Audience)
  • Type of localization error
  • Fine (errorlt 50cm for lecturer,100 cm for
    audience)
  • Gross (otherwise)
  • Speech Activity Detection (external vs internal
    to the localization system)
  • False alarm rate
  • Deletion rate
  • Average Frame rate (/s)
  • Bias (fine and gross)
  • Localization precision PcorNFineErrors/Nlocalizat
    ions
  • When is this evaluation meaningful? For each
    analysis segment we need to know if one or more
    acoustic sources (persons or noise) are active
    and, only in this case, an accurate set of x-y-z
    coordinates!
  • Related evaluation software (developed at
    ITC-irst and available at NIST and CHIL web
    sites)

13
Evaluation software
  • It consists of two steps
  • XML converter from manual transcription file and
    3D label file to reference file
  • c-code to derive results
  • First step

Transcriptions
Reference
ltTurn startTime"19.637" endTime"548.798"
speaker"lecturer"gt ltSync time"109.246"/gt repor
t some results at this ltSync time"110.861"/gt ltEve
nt desc"nc-s" type"noise" extent"next"/gt level
that is very preliminary ltSync
time"113.843"/gt ltEvent desc"pap" type"noise"
extent"instantaneous"/gt ltSync time"114.753"/gt
so uh starting from a brief introd ltEvent
desc"()" type"lexical" extent"previous"/gt
introduction of what are the main problems we
have to face with ltSync time"125.269"/gt
116.910 1 1 lecturer 1904.90 4034.49
1546.38 117.576 1 1 lecturer 1669.07 4121.32
1571.96 118.243 1 1 lecturer 1371.42 4297.96
1585.04 118.910 1 1 lecturer 1339.13 4478.81
1564.58 119.575 1 1 lecturer 1225.10 4581.53
1574.26 120.243 1 1 lecturer 1065.48 4678.85
1562.49 120.908 1 1 lecturer 1116.63 4696.75
1569.60 121.575 1 1 lecturer 1294.30 4687.07
1566.99 122.242 1 1 lecturer 1369.35 4618.37
1523.87 122.908 1 1 lecturer 1369.59 4646.71
1579.83 128.908 1 0 audience 1165.48 4678.85
1562.49 128.908 1 0 audience 965.48 4228.85
1532.49 - 130.242 1 1 lecturer 1449.35
4638.37 1533.87
14
Evaluation software
  • Second step
  • Evaluation software reference seminar.ref
    inputFile seminar.loc evalOutput seminar.out
    evalSummary seminar.sum thresholdLecturer 500
    thresholdAudience 1000 timestep 667

Localization output
Evaluation
116.75 1554.0 4190.2 1700 117.96 1403.5 4353.0
1700 118.05 1398.4 4353.1 1700 118.14 1355.9
4371.6 1700 118.24 1312.8 4374.5 1700 118.52
1216.0 4502.9 1700 123.62 1886.0 4475.2
1700 124.37 2037.3 1558.2 1700 124.46 2029.0
1540.4 1700 124.65 1993.0 1437.4 1700
322.53 ND Ignored (Multiple Speakers) 323.19
ND False Alarm 323.86 ND No Speaker 325.86
ND No Speaker 326.52 ND Deletion Lecturer
331.19 ND Ignored (Multiple Speakers) 331.86
ND Ignored (Multiple Speakers) 332.52 135 Fine
Error Lecturer
Lecturer Audience Overall Pcor 0.94 0.83
0.94 Bias fine (x,y,z)mm (79,-3,-1) (106,-241,
-22) (80,-7,-2) Bias finegross
(x,y,z)mm (115,35,-3) (177,-34,-12) (116,34,-3
) RMSE fine mm 236 377 238 RMSE finegross
mm 532 579 532 Deletion rate 0.35 0.81 0.
37 False Alarm rate 0.47 Loc. frames for
error statistics 402 6 408 N. output
loc.frames2242 Reference Duration930.0
Average Frames/sec2.41 N. reference frames1283
Summary
15
NIST evaluation 05 of SLOC systems
  • Participants IRST,TU, UKA
  • Seminar segments
  • 13 seminars recorded on November 23rd 2004, and
    in January and February 2005, at Karlsruhe
    University
  • In this NIST evaluation, performance regarded
    only lecturers
  • Evaluation software - parameters
  • Thresholds for fine and gross errors 50 cm
    (lecturer), 100 cm (audience)
  • Time Step667 ms
  • Evaluation summary - metrics
  • Average frame rate, N. of loc. frames for
    statistics on lecturer, False alarm rate,
    Deletion rate, Localization rate (Pcor), RMSE
    fine, RMSE finegross

16
Experimental Results
  • 13 Seminars E1 segments
  • N. of reference frames 5788 (4014 s)
  • TimeStep 667 ms

17
x-coordinate output examples
Seminar 20041123-09
18
x-coordinate output examples
Seminar 20041124-09
19
x-coordinate output examples
Seminar 20041123-10
20
IRST speaker localization and tracking systems
Maurizio Omologo with contributions by Alessio
Brutti, Luca Cristoforetti, Piergiorgio
Svaizer ITC-irst, Povo, Trento, Italy
NIST Rich Transcription05 Evaluation
Workshop, Edinburgh, July 13th, 2005
21
System description
  • Two techniques
  • 1a) Use of two T-shaped arrays (B and D), two
    pairs for 2D(x-y) location
  • 1b) Use of two pairs for the z-coordinate
    directions derived by CSP (GCC-PHAT) TDOA
    analysis
  • 2) Use of three T-shaped arrays and of Global
    Coherence Field (GCF)

22
TDOA estimate based on microphone pairs and CSP
(GCC-PHAT) analysis
  • (see Knapp-Carter 1976, Omologo-Svaizer,
    ICASSP 1994-1996, Trans. on SAP 1997,
    and U.S. Patent
    5,465,302, October 1992)

23
IRST T-shaped microphone array
  • Technique based on CSP analysis
  • Use of four microphones (3 pairs)
  • Accurate 3-D speaker location using few
    microphones
  • Since 1999, it is a product (AETHRA, Italy)

24
Global Coherence Field
Sound source position
Q number of sensors C coherence at a given
microphone pair
Time delay at pair (i,k) assuming that the source
is in (x,y,z).
  • 3D location based on TDOA of vertical mic.
    pairs,
  • once a 2D location was derived by maximizing GCF
    in all x-y coordinates

25
Recent results on UKA lectures
26
CSP analysis of a segment with two speakers (from
Seminar_2003-11-25_A_4)
27
Conclusions
  • This NIST evaluation has been very important to
    establish an evaluation approach introduced in
    CHIL during the last year
  • To better understand the potential of the SLOC
    technologies under study
  • Need to further improve reference transcriptions
  • Need to reduce the number of metrics for
    instance, combining false alarm rate and deletion
    rate in a unique feature imposing the same
    external SAD etc
  • Need to address a real multi-speaker lecture
    scenario
  • much more challenging
  • new annotation tools are needed
  • For meetings different evaluation criteria are
    maybe necessary
  • Also person tracking based on audio-video fusion
    will require other evaluation criteria
Write a Comment
User Comments (0)
About PowerShow.com