Title: Speaker Recognition Research in Joensuu
1Speaker Recognition Research in Joensuu
Puheteknologian talviseminaari
Pasi Fränti
Joensuu 10.3.2006
- Speech and Image Processing Unit (SIPU)
http//cs.joensuu.fi/sipu/
2Goals for PUMS season 3 (1/2)
- Usability of automatic speaker identification in
forensic applications - Compatibility with large databases
- Automatization of LTAS fusion with MFCC.
- Voice activity detection
3Goals for PUMS season 3 (2/2)
- Speaker verification in real (noisy) environment
- Prototype for access control
- Solving technical requirements for prototype in
elevator. - Usability for detecting sound sources in general
- Key word search (using HTK or Lingsoft
Recognizer)
4PUMS personnel
Pasi Fränti Professor
Ilja Sidoroff
Marko Tuononen, BSc
Rosa Gonzalez-Hautamäki, MSc
Doctoral researchers
Collaborators
Juhani Saastamoinen, PhLic
Ismo Kärkkäinen, MSc
Ville Hautamäki, MSc
Tomi Kinnunen, PhD (Singapore)
Victoria Yanulevskaya
Evgeny Karpov, MSc (NRC)
51. Applicability to forensic applications
- Automatic speaker recognition study has been
done. - Results are not reported but actions taken within
tasks 3 and 4. - Material can be found in Kinnunens PhD thesis
4 and Niemi-Laitinens presentation.
62. Support for large databases
73. LTAS and other features
- Automatic calculation of LTAS done. Integration
to WinSprofiler in progress. Reporting in
progress. - Benefit of LTAS is merely its speed and ease of
use no difficult control parameters. - No additional benefit to recognition accuracy.
MFCC includes the same information. - Could be used for preliminary pruning in case of
large datasets.
8Noise robustness of F0 feature
Results reported in 3, 5
94. Voice activity detection
- Software for speech segmentation (VoiceGrep).
- Command line version for Linux.
- Windows version in WinSprofiler.
- Testing done in SIPU laboratory.
- Labtec pc mic 333, 44,1 kHz
- Recordings were emphasized 24 dB by Audacity
voice editor
104a. Test material and results
- Material
- 4 hours in total.
- Bad quality recordings 11 bits data, of which
4-5 informatio, and the rest noise. - VoiceGrep made 168 detections
- 56 speech (33)
- 112 non-speech (67)
- Material included 71 real speech segments
- Average segment length 16 s.
- VoiceGrep found 25 of these (35 )
114b. VoiceGrep overall results
124c. VoiceGrep example(Correct detection)
End of the speech is missed
Start of the speech is detected correctly
Play sample 1
134d. VoiceGrep example(false detections)
Door opening
Running water
Walking
Door
Play sample 2
Play sample 3
144e. VoiceGrep example(missed speech segment)
Door
Door
Speech and walking
Play sample 4
154f. Entire data set(4 hours)
Data
Speech segments
Result of VoiceGrep
165. Speaker verification in noisy environment
- Systematic testing of the effective parameters
has been reported in 1. - Applicability of speaker verification in real
environment has been reported in 2 and in
Kinnunens PhD thesis 5. - Additional testing will be done if enough time.
175a. Text-dependent verificationin access control
- Utilizing time series information improves
recognition. - Best result if everyone has their own password.
186. Prototype for access control
Emergency button
Microphone
Motion detector
197. Calling elevator(technical requirements)
- Communication with OPC-server
- Implemented with Matrikon server.
- Program logic to elevator implemented
- Reads variables from OPC-server.
- Interprets and shows elevator status.
- Includes recording logic.
- Speaker and voice related stuff
- Not yet implemented.
- Main window does not show anything yet.
208. Usability for detecting sound sources in
general
219. Keyword search
22Publications (season 3)
- J. Saastamoinen, Z. Fiedler, T. Kinnunen and
P. Fränti, "On factors affecting MFCC-based
speaker recognition accuracy", Int. Conf. on
Speech and Computer (SPECOM'05), Patras, Greece,
503-506, October 2005. - H. Gupta, V. Hautamäki, T. Kinnunen and
P. Fränti, "Field evaluation of text-dependent
speaker recognition in an access control
application", Int. Conf. on Speech and Computer
(SPECOM'05), Patras, Greece, 551-554, October
2005. - T. Kinnunen, R. Gonzalez-Hautamäki, "Long-Term F0
Modeling for Text-Independent Speaker
Recognition" Int. Conf. on Speech and Computer
(SPECOM'05), Patras, Greece, 567-570, October
2005.
23Theses (season 3)Opinnäytetyöt
- T. Kinnunen, "Optimizing Spectral Feature Based
Text Independent Speaker Recognition, PhD
thesis, University of Joensuu, June 2005. - R. Gonzalez-Hautamäki, "Fundamental Frequency
Estimation and Modeling for Speaker Recognition,
MSc thesis, University of Joensuu, July 2005.
24Applications scenarios
Speaker Recognition
Speaker Verification
Speaker Identification
Whose voice is this?
Is this Bobs voice?
?
(Claim)
Identification
Verification
Imposter!
25Software 1 Console program
26Software 2 WinSprofiler
27Software 3 Symbian
Port to Symbian OS with Series 60 UI platform
28Software 4 Door SProfiler
Opening laboratory door by speaking
29Software 5 Lift SProfiler(to appear in season 4
perhaps)
30Future development (1)
Software integration
Keyword search
WinSprofilerWindows (JoY)MobileSeries 60 (JoY)
DBsupport
SRLIB
VAD
MSE
F0 extractionfusion by weighted MSE
GMM
VQ
MFCC
LTAS
31Future development (2)
Applications
Call center
Forensic applications
Calling elevator
Speech analyzer tool
Access control
common speaker recognition app. interface
Verification
Classifier fusion
Segmentation
Keyword search
srlib
VAD
DB
32Future development (3)
Technical development
- Implement and integrate F0, maybe also other
formants (F1, F2). - Automatic voiced/unvoiced segmentation.
- User enrollment.
- Use of sequence information (triplets).
- Development of WinSprofiler software to the
direction of voice profiler and speech analyzer
tool!
33Future development (4)
Machine room
Lift car hardware
CAN
GW box
EthernetTCP/IP
Display
Microphone
Our PC
Approach detection
OPC server
SRLIB 3.0
DCOM
Elevator prototype
OPC client
LiftCaller
34Vision 1 Teleconferencing
Speaker Recognition
Unkonwn
Minna
Bob
35Vision 2 Call-center
- Speech is the main tool for people in
call-center - Voice login of personell
- Removes the need for manual entry
36Vision 3 Language recognition
- Related problem to speaker recognition the same
research groups usually study both problems. - Not trivial to solve.
- Studied a lot for Asian languages, even for rare
languages that do not have any written form.
37Vision 4 Medical applications
- Doctor use voice to record summary of patient
meetings. - Access by keyword search.
- Annotation.
- Authentication of speaker.
38Thank for you patience!