Speaker Localization: introduction to system evaluation

About This Presentation

Title:

Speaker Localization: introduction to system evaluation

Description:

Center for Scientific and Technological Research ... (fixed. Pan. Tilt. Zoom. Camera. NIST. MARKIII. IRST Light. Microphone. Array. Screen. Lecturer ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 28

Provided by: velblodVid

Category:

more less

Transcript and Presenter's Notes

Title: Speaker Localization: introduction to system evaluation

1
Speaker Localization introduction to system
evaluation

Maurizio Omologo
with contributions by
Alessio Brutti, Luca Cristoforetti, Piergiorgio
Svaizer
ITC-irst, Povo, Trento, Italy

NIST Rich Transcription05 Evaluation
Workshop, Edinburgh, July 13th, 2005
2
Outline

Acoustic source/Speaker LOCalization (SLOC) and
tracking general issues
The localization problem in the lecture scenario
of CHIL
Evaluation Criteria
Software developed at IRST
Experimental results
Examples
Description of IRST systems being evaluated

3
Speaker localization general issues

Problem definition locate and track, in 2D or
3D, active speakers (one or more) in a given
multi-speaker scenario
2D localization the source is assumed to be
located in the plane where acoustic sensors are
placed
Forced assumption the speaker, and any other
acoustic source, are assumed to be point sources,
in general slowly moving and emitting large-band
unstationary signals. Radiation effects are
neglected.
Key technical aspect acoustic sensor signals are
very different each other. Their characteristics
depend on speaker positions, room acoustics
(reflections, reverberation), background
noise,etc
Most common approach
0) Detect an acoustic event (see the Speech
Activity Detection problem)
1) compute Time Difference of Arrival (TDOA) at
different microphone pairs
2) derive source position estimate from geometry,
and
3) apply possible constraints (e.g. to ignore
locations outside the room)

4
Example of very near-field propagation
Distance between the microphones of a pair 12
cm Speed of sound 340 m/s
In general, TDOA is computed on the basis of
coherence in the direct wavefront
For far field, plane propagation can be assumed
Animation courtesy of Dr. Dan Russell, Kettering
University
5
T-shaped arrays in the CHIL room at IRST
Speaker area
Animation courtesy of Dr. Dan Russell, Kettering
University
6
Speaker localization general issues

Sensory system characteristics
Number of microphones
Sensitivity, spatial and spectral response of the
microphones
Position of each microphone in the room
need of a calibration step
Information required for the evaluation
Reference time stamps
sample level synchronous recordings of all the
microphones, or
need of the offset information to time align
signals recorded by different acquisition
platforms
Offset information to time align audio and video
recordings
Ground truth 3D labels for each active speaker
in general they are derived from a set of
calibrated video-cameras, a mouth tracking
algorithm, and a final manual check.
In CHIL, 3D labels of the lecturer are updated
every 667 ms.
The output sequence of time stamps 3D labels

7
CHIL Speaker localization in lecture scenarios

Common sensor set-up in CHIL consortium
3 T-shaped mic.arrays, 2 close-talk, 1 MarkIII
array

optional use of all the microphones available in
a site
UKA set-up
4 T-shaped arrays,
One/two Country-man close-talk,
One NIST-MarkIII (IRST-light version),
table-top mics

8
Evaluation Criteria

Accurate (Lecturer) vs Rough localization
(Audience)
Type of localization errors
Fine (errorlt 50cm for lecturer,100 cm for
audience)
Gross (otherwise)

9
Evaluation Criteria fine and gross errors

Camera
(fixed
)

Pan
-
Tilt
-
Zoom
Screen

Camera

NIST
-
Table for meetings
MARKIII
-
IRST Light

Microphone

Array

10
Evaluation Criteria

Accurate (Lecturer) vs Rough localization
(Audience)
Type of localization error
Fine (errorlt 50cm for lecturer,100 cm for
audience)
Gross (otherwise)
Speech Activity Detection (external vs internal
to the localization system)
False alarm rate
Deletion rate
Average Frame rate (/s)

Finegross represents the most relevant cue for
SLOC accuracy evaluation
11
SLOC error computation in a time interval
12
Evaluation Criteria

Accurate (Lecturer) vs Rough localization
(Audience)
Type of localization error
Fine (errorlt 50cm for lecturer,100 cm for
audience)
Gross (otherwise)
Speech Activity Detection (external vs internal
to the localization system)
False alarm rate
Deletion rate
Average Frame rate (/s)
Bias (fine and gross)
Localization precision PcorNFineErrors/Nlocalizat
ions
When is this evaluation meaningful? For each
analysis segment we need to know if one or more
acoustic sources (persons or noise) are active
and, only in this case, an accurate set of x-y-z
coordinates!
Related evaluation software (developed at
ITC-irst and available at NIST and CHIL web
sites)

13
Evaluation software

It consists of two steps
XML converter from manual transcription file and
3D label file to reference file
c-code to derive results
First step

Transcriptions
Reference
ltTurn startTime"19.637" endTime"548.798"
speaker"lecturer"gt ltSync time"109.246"/gt repor
t some results at this ltSync time"110.861"/gt ltEve
nt desc"nc-s" type"noise" extent"next"/gt level
that is very preliminary ltSync
time"113.843"/gt ltEvent desc"pap" type"noise"
extent"instantaneous"/gt ltSync time"114.753"/gt
so uh starting from a brief introd ltEvent
desc"()" type"lexical" extent"previous"/gt
introduction of what are the main problems we
have to face with ltSync time"125.269"/gt
116.910 1 1 lecturer 1904.90 4034.49
1546.38 117.576 1 1 lecturer 1669.07 4121.32
1571.96 118.243 1 1 lecturer 1371.42 4297.96
1585.04 118.910 1 1 lecturer 1339.13 4478.81
1564.58 119.575 1 1 lecturer 1225.10 4581.53
1574.26 120.243 1 1 lecturer 1065.48 4678.85
1562.49 120.908 1 1 lecturer 1116.63 4696.75
1569.60 121.575 1 1 lecturer 1294.30 4687.07
1566.99 122.242 1 1 lecturer 1369.35 4618.37
1523.87 122.908 1 1 lecturer 1369.59 4646.71
1579.83 128.908 1 0 audience 1165.48 4678.85
1562.49 128.908 1 0 audience 965.48 4228.85
1532.49 - 130.242 1 1 lecturer 1449.35
4638.37 1533.87
14
Evaluation software

Second step
Evaluation software reference seminar.ref
inputFile seminar.loc evalOutput seminar.out
evalSummary seminar.sum thresholdLecturer 500
thresholdAudience 1000 timestep 667

Localization output
Evaluation
116.75 1554.0 4190.2 1700 117.96 1403.5 4353.0
1700 118.05 1398.4 4353.1 1700 118.14 1355.9
4371.6 1700 118.24 1312.8 4374.5 1700 118.52
1216.0 4502.9 1700 123.62 1886.0 4475.2
1700 124.37 2037.3 1558.2 1700 124.46 2029.0
1540.4 1700 124.65 1993.0 1437.4 1700
322.53 ND Ignored (Multiple Speakers) 323.19
ND False Alarm 323.86 ND No Speaker 325.86
ND No Speaker 326.52 ND Deletion Lecturer
331.19 ND Ignored (Multiple Speakers) 331.86
ND Ignored (Multiple Speakers) 332.52 135 Fine
Error Lecturer
Lecturer Audience Overall Pcor 0.94 0.83
0.94 Bias fine (x,y,z)mm (79,-3,-1) (106,-241,
-22) (80,-7,-2) Bias finegross
(x,y,z)mm (115,35,-3) (177,-34,-12) (116,34,-3
) RMSE fine mm 236 377 238 RMSE finegross
mm 532 579 532 Deletion rate 0.35 0.81 0.
37 False Alarm rate 0.47 Loc. frames for
error statistics 402 6 408 N. output
loc.frames2242 Reference Duration930.0
Average Frames/sec2.41 N. reference frames1283
Summary
15
NIST evaluation 05 of SLOC systems

Participants IRST,TU, UKA
Seminar segments
13 seminars recorded on November 23rd 2004, and
in January and February 2005, at Karlsruhe
University
In this NIST evaluation, performance regarded
only lecturers
Evaluation software - parameters
Thresholds for fine and gross errors 50 cm
(lecturer), 100 cm (audience)
Time Step667 ms
Evaluation summary - metrics
Average frame rate, N. of loc. frames for
statistics on lecturer, False alarm rate,
Deletion rate, Localization rate (Pcor), RMSE
fine, RMSE finegross

16
Experimental Results

13 Seminars E1 segments
N. of reference frames 5788 (4014 s)
TimeStep 667 ms

17
x-coordinate output examples
Seminar 20041123-09
18
x-coordinate output examples
Seminar 20041124-09
19
x-coordinate output examples
Seminar 20041123-10
20
IRST speaker localization and tracking systems
Maurizio Omologo with contributions by Alessio
Brutti, Luca Cristoforetti, Piergiorgio
Svaizer ITC-irst, Povo, Trento, Italy
NIST Rich Transcription05 Evaluation
Workshop, Edinburgh, July 13th, 2005
21
System description

Two techniques
1a) Use of two T-shaped arrays (B and D), two
pairs for 2D(x-y) location
1b) Use of two pairs for the z-coordinate
directions derived by CSP (GCC-PHAT) TDOA
analysis
2) Use of three T-shaped arrays and of Global
Coherence Field (GCF)

22
TDOA estimate based on microphone pairs and CSP
(GCC-PHAT) analysis

(see Knapp-Carter 1976, Omologo-Svaizer,
ICASSP 1994-1996, Trans. on SAP 1997,
and U.S. Patent
5,465,302, October 1992)

23
IRST T-shaped microphone array

Technique based on CSP analysis
Use of four microphones (3 pairs)
Accurate 3-D speaker location using few
microphones

Since 1999, it is a product (AETHRA, Italy)

24
Global Coherence Field
Sound source position
Q number of sensors C coherence at a given
microphone pair
Time delay at pair (i,k) assuming that the source
is in (x,y,z).

3D location based on TDOA of vertical mic.
pairs,
once a 2D location was derived by maximizing GCF
in all x-y coordinates

25
Recent results on UKA lectures
26
CSP analysis of a segment with two speakers (from
Seminar_2003-11-25_A_4)
27
Conclusions

This NIST evaluation has been very important to
establish an evaluation approach introduced in
CHIL during the last year
To better understand the potential of the SLOC
technologies under study
Need to further improve reference transcriptions
Need to reduce the number of metrics for
instance, combining false alarm rate and deletion
rate in a unique feature imposing the same
external SAD etc
Need to address a real multi-speaker lecture
scenario
much more challenging
new annotation tools are needed
For meetings different evaluation criteria are
maybe necessary
Also person tracking based on audio-video fusion
will require other evaluation criteria

Write a Comment

User Comments (0)

About PowerShow.com

Speaker Localization: introduction to system evaluation - PowerPoint PPT Presentation

Speaker Localization: introduction to system evaluation

Center for Scientific and Technological Research ... (fixed. Pan. Tilt. Zoom. Camera. NIST. MARKIII. IRST Light. Microphone. Array. Screen. Lecturer ... – PowerPoint PPT presentation