Title: Speaker Localization: introduction to system evaluation
1Speaker Localization introduction to system
evaluation
- Maurizio Omologo
- with contributions by
- Alessio Brutti, Luca Cristoforetti, Piergiorgio
Svaizer - ITC-irst, Povo, Trento, Italy
NIST Rich Transcription05 Evaluation
Workshop, Edinburgh, July 13th, 2005
2Outline
- Acoustic source/Speaker LOCalization (SLOC) and
tracking general issues - The localization problem in the lecture scenario
of CHIL - Evaluation Criteria
- Software developed at IRST
- Experimental results
- Examples
- Description of IRST systems being evaluated
3Speaker localization general issues
- Problem definition locate and track, in 2D or
3D, active speakers (one or more) in a given
multi-speaker scenario - 2D localization the source is assumed to be
located in the plane where acoustic sensors are
placed - Forced assumption the speaker, and any other
acoustic source, are assumed to be point sources,
in general slowly moving and emitting large-band
unstationary signals. Radiation effects are
neglected. - Key technical aspect acoustic sensor signals are
very different each other. Their characteristics
depend on speaker positions, room acoustics
(reflections, reverberation), background
noise,etc - Most common approach
- 0) Detect an acoustic event (see the Speech
Activity Detection problem) - 1) compute Time Difference of Arrival (TDOA) at
different microphone pairs - 2) derive source position estimate from geometry,
and - 3) apply possible constraints (e.g. to ignore
locations outside the room)
4Example of very near-field propagation
Distance between the microphones of a pair 12
cm Speed of sound 340 m/s
In general, TDOA is computed on the basis of
coherence in the direct wavefront
For far field, plane propagation can be assumed
Animation courtesy of Dr. Dan Russell, Kettering
University
5T-shaped arrays in the CHIL room at IRST
Speaker area
Animation courtesy of Dr. Dan Russell, Kettering
University
6Speaker localization general issues
- Sensory system characteristics
- Number of microphones
- Sensitivity, spatial and spectral response of the
microphones - Position of each microphone in the room
need of a calibration step - Information required for the evaluation
- Reference time stamps
- sample level synchronous recordings of all the
microphones, or - need of the offset information to time align
signals recorded by different acquisition
platforms - Offset information to time align audio and video
recordings - Ground truth 3D labels for each active speaker
in general they are derived from a set of
calibrated video-cameras, a mouth tracking
algorithm, and a final manual check.
In CHIL, 3D labels of the lecturer are updated
every 667 ms. - The output sequence of time stamps 3D labels
7CHIL Speaker localization in lecture scenarios
- Common sensor set-up in CHIL consortium
- 3 T-shaped mic.arrays, 2 close-talk, 1 MarkIII
array
- optional use of all the microphones available in
a site - UKA set-up
- 4 T-shaped arrays,
- One/two Country-man close-talk,
- One NIST-MarkIII (IRST-light version),
- table-top mics
8Evaluation Criteria
- Accurate (Lecturer) vs Rough localization
(Audience) - Type of localization errors
- Fine (errorlt 50cm for lecturer,100 cm for
audience) - Gross (otherwise)
-
-
-
-
-
-
-
-
-
-
9Evaluation Criteria fine and gross errors
Camera
(fixed
)
Pan
-
Tilt
-
Zoom
Screen
Camera
NIST
-
Table for meetings
MARKIII
-
IRST Light
Microphone
Array
10Evaluation Criteria
- Accurate (Lecturer) vs Rough localization
(Audience) - Type of localization error
- Fine (errorlt 50cm for lecturer,100 cm for
audience) - Gross (otherwise)
- Speech Activity Detection (external vs internal
to the localization system) - False alarm rate
- Deletion rate
- Average Frame rate (/s)
-
-
Finegross represents the most relevant cue for
SLOC accuracy evaluation
11SLOC error computation in a time interval
12Evaluation Criteria
- Accurate (Lecturer) vs Rough localization
(Audience) - Type of localization error
- Fine (errorlt 50cm for lecturer,100 cm for
audience) - Gross (otherwise)
- Speech Activity Detection (external vs internal
to the localization system) - False alarm rate
- Deletion rate
- Average Frame rate (/s)
- Bias (fine and gross)
- Localization precision PcorNFineErrors/Nlocalizat
ions - When is this evaluation meaningful? For each
analysis segment we need to know if one or more
acoustic sources (persons or noise) are active
and, only in this case, an accurate set of x-y-z
coordinates! - Related evaluation software (developed at
ITC-irst and available at NIST and CHIL web
sites)
13Evaluation software
- It consists of two steps
- XML converter from manual transcription file and
3D label file to reference file - c-code to derive results
- First step
Transcriptions
Reference
ltTurn startTime"19.637" endTime"548.798"
speaker"lecturer"gt ltSync time"109.246"/gt repor
t some results at this ltSync time"110.861"/gt ltEve
nt desc"nc-s" type"noise" extent"next"/gt level
that is very preliminary ltSync
time"113.843"/gt ltEvent desc"pap" type"noise"
extent"instantaneous"/gt ltSync time"114.753"/gt
so uh starting from a brief introd ltEvent
desc"()" type"lexical" extent"previous"/gt
introduction of what are the main problems we
have to face with ltSync time"125.269"/gt
116.910 1 1 lecturer 1904.90 4034.49
1546.38 117.576 1 1 lecturer 1669.07 4121.32
1571.96 118.243 1 1 lecturer 1371.42 4297.96
1585.04 118.910 1 1 lecturer 1339.13 4478.81
1564.58 119.575 1 1 lecturer 1225.10 4581.53
1574.26 120.243 1 1 lecturer 1065.48 4678.85
1562.49 120.908 1 1 lecturer 1116.63 4696.75
1569.60 121.575 1 1 lecturer 1294.30 4687.07
1566.99 122.242 1 1 lecturer 1369.35 4618.37
1523.87 122.908 1 1 lecturer 1369.59 4646.71
1579.83 128.908 1 0 audience 1165.48 4678.85
1562.49 128.908 1 0 audience 965.48 4228.85
1532.49 - 130.242 1 1 lecturer 1449.35
4638.37 1533.87
14Evaluation software
- Second step
- Evaluation software reference seminar.ref
inputFile seminar.loc evalOutput seminar.out
evalSummary seminar.sum thresholdLecturer 500
thresholdAudience 1000 timestep 667
Localization output
Evaluation
116.75 1554.0 4190.2 1700 117.96 1403.5 4353.0
1700 118.05 1398.4 4353.1 1700 118.14 1355.9
4371.6 1700 118.24 1312.8 4374.5 1700 118.52
1216.0 4502.9 1700 123.62 1886.0 4475.2
1700 124.37 2037.3 1558.2 1700 124.46 2029.0
1540.4 1700 124.65 1993.0 1437.4 1700
322.53 ND Ignored (Multiple Speakers) 323.19
ND False Alarm 323.86 ND No Speaker 325.86
ND No Speaker 326.52 ND Deletion Lecturer
331.19 ND Ignored (Multiple Speakers) 331.86
ND Ignored (Multiple Speakers) 332.52 135 Fine
Error Lecturer
Lecturer Audience Overall Pcor 0.94 0.83
0.94 Bias fine (x,y,z)mm (79,-3,-1) (106,-241,
-22) (80,-7,-2) Bias finegross
(x,y,z)mm (115,35,-3) (177,-34,-12) (116,34,-3
) RMSE fine mm 236 377 238 RMSE finegross
mm 532 579 532 Deletion rate 0.35 0.81 0.
37 False Alarm rate 0.47 Loc. frames for
error statistics 402 6 408 N. output
loc.frames2242 Reference Duration930.0
Average Frames/sec2.41 N. reference frames1283
Summary
15NIST evaluation 05 of SLOC systems
- Participants IRST,TU, UKA
- Seminar segments
- 13 seminars recorded on November 23rd 2004, and
in January and February 2005, at Karlsruhe
University - In this NIST evaluation, performance regarded
only lecturers - Evaluation software - parameters
- Thresholds for fine and gross errors 50 cm
(lecturer), 100 cm (audience) - Time Step667 ms
- Evaluation summary - metrics
- Average frame rate, N. of loc. frames for
statistics on lecturer, False alarm rate,
Deletion rate, Localization rate (Pcor), RMSE
fine, RMSE finegross
16Experimental Results
- 13 Seminars E1 segments
- N. of reference frames 5788 (4014 s)
- TimeStep 667 ms
17 x-coordinate output examples
Seminar 20041123-09
18 x-coordinate output examples
Seminar 20041124-09
19 x-coordinate output examples
Seminar 20041123-10
20IRST speaker localization and tracking systems
Maurizio Omologo with contributions by Alessio
Brutti, Luca Cristoforetti, Piergiorgio
Svaizer ITC-irst, Povo, Trento, Italy
NIST Rich Transcription05 Evaluation
Workshop, Edinburgh, July 13th, 2005
21System description
- Two techniques
- 1a) Use of two T-shaped arrays (B and D), two
pairs for 2D(x-y) location - 1b) Use of two pairs for the z-coordinate
directions derived by CSP (GCC-PHAT) TDOA
analysis - 2) Use of three T-shaped arrays and of Global
Coherence Field (GCF)
22TDOA estimate based on microphone pairs and CSP
(GCC-PHAT) analysis
- (see Knapp-Carter 1976, Omologo-Svaizer,
ICASSP 1994-1996, Trans. on SAP 1997,
and U.S. Patent
5,465,302, October 1992)
23IRST T-shaped microphone array
- Technique based on CSP analysis
- Use of four microphones (3 pairs)
- Accurate 3-D speaker location using few
microphones
- Since 1999, it is a product (AETHRA, Italy)
24Global Coherence Field
Sound source position
Q number of sensors C coherence at a given
microphone pair
Time delay at pair (i,k) assuming that the source
is in (x,y,z).
- 3D location based on TDOA of vertical mic.
pairs, - once a 2D location was derived by maximizing GCF
in all x-y coordinates
25Recent results on UKA lectures
26CSP analysis of a segment with two speakers (from
Seminar_2003-11-25_A_4)
27Conclusions
- This NIST evaluation has been very important to
establish an evaluation approach introduced in
CHIL during the last year - To better understand the potential of the SLOC
technologies under study - Need to further improve reference transcriptions
- Need to reduce the number of metrics for
instance, combining false alarm rate and deletion
rate in a unique feature imposing the same
external SAD etc - Need to address a real multi-speaker lecture
scenario - much more challenging
- new annotation tools are needed
- For meetings different evaluation criteria are
maybe necessary - Also person tracking based on audio-video fusion
will require other evaluation criteria