Review of ICASSP 2004 - PowerPoint PPT Presentation

About This Presentation

Title:

Review of ICASSP 2004

Description:

Speech Processing Sessions (SpL1-L11, SpP1-16) Many people because of SARS in Hong Kong last year. Speech/Speaker recognition, TTS/Voice morphing, speech coding, ... – PowerPoint PPT presentation

Number of Views:13

Avg rating:3.0/5.0

Slides: 26

Provided by: Arthu61

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Review of ICASSP 2004

1
Review of ICASSP 2004

Arthur Chan

2
Part I of This presentation (6 pages)

Pointers of ICASSP 2004 (2 pages)
NIST Meeting Transcription Workshop (2 pages)

3
Session Summary

Speech Processing Sessions (SpL1-L11, SpP1-16)
Many people because of SARS in Hong Kong last
year.
Speech/Speaker recognition, TTS/Voice morphing,
speech coding,
Signal Processing Sessions (Sam, Sptm, Ae-P6)
Image Processing Sessions (Imdsp)
Machine Learning Sessions (Mlsp)
Multimedia Processing Sessions (Msp)
Applications (Itt)

4
Quick Speech Paper Pointer

Acoustic Modeling and Adaptation (SP-P2, SP-P3,
SP-P 14)
Noisy Speech Processing/Recognition (SP-P6,
SP-P13)
Language Modeling (SP-L11)
Speech Processing in the meeting domain.
R04 Rich Transcription in meeting domain.
Handbook can be obtained from Arthur.
Speech Application/Systems (ITT-P2, MSP-P1,
MSP-P2)
Speech Understanding (SP-P4)
Feature-analysis (SP-P6, SP-L6)
Voice Morphing (SP-L1)
TTS

5
Meeting Transcription Workshop

Message Meeting transcription is hard
Problems in core technology
Cross talk causes a lot of trouble on SR and
speaker segmentation.
Problems in evaluation
Cross talk causes a lot of trouble in string
evaluation.
Problems in resource creation
Transcription becomes very hard
Tool is not yet available.

6
Speech Recognition

Big challenge in speech recognition
65 average ERR using state-of-the art
technology of
Acoustic modeling and language modeling
Speaker adaptation
Discriminative training
Signal Processing using multi-distance
microphones
Observations
Speech recognition become poorer when there are
more speakers.
Multi-distance is a big win. May be microphone
array will also be.

7
End of Part I

Jim asked about why FA is counted at Jun 18, 2004
Q Is it reasonable to give the same weighting
to FA as to Missing Speaker and Wrong Speaker?

8
Part II

More on Diarization Error Measurement (7 pages)
Is the current DER reasonable?
Lightly Supervised Training (6 pages)

9
More on Diarization Error Measurement (7 pages)

Its Goal
Discover how many persons are involved in the
conversation
Assign speech segments to a particular segments
Usually assume no prior knowledge of the speakers
Application
Unsupervised speaker adaptation,
Automatic archiving and indexing acoustic data.

10
Usual procedures of Speaker Diarization

1, Speaker Segmentation
Segment a N-speaker audio document into segments
which is believed to be spoken by one speaker.
2, Speaker Clustering
Assign segments to hypothesized speakers

11
Diarization Process
Ref_Spk1
Ref_Spk2
Ref
Hyp_Spk1
Hyp_Spk1
Hyp_Spk2
Sys
False Alarm
Miss
Speaker Err
12
Definition of Diarization Error

Rough segmentation are first provided as
reference.
Another stage of acoustic segmentation will also
be applied on the segmentation
Definition

Duration of the segment
Number of speakers in the Reference
Number of speakers provided by the system
Number of speaker in the reference which is
hypothesize correctly by the system
13
Breakdown to three types of errors

Speaker that is attributed to the wrong speaker
(or speaker error time), sum of

Missed Speaker time sum of segments where more
reference speaker than system speakers.

False Alarm sum of segments where more system
speakers than the reference.

14
Re Jim, possible extension of the measure

Current measures is weighted by the number of
mistakes made

Possible way to extend the definition

15
Other Practical Concerns of Measuring DER

In NIST evaluation guideline
Only rough segmentation is provided at the
beginning.
250 ms time collar is provided in the evalution
Breaks of a speaker less than 0.3s doesnt count.

16
My Conclusion

Weakness of current measure
Because of FA, DER can be larger than 100.
But most systems perform much better than that
Constraints are also provided to make the measure
reasonable.
Also, as in WER
It is pretty hard to decide how to weigh deletion
and insertion errors.
So,
current measure is imperfect
however, it might be to extend it to be more
reasonable

17
Further References

Spring 2004 (RT-04S) Rich Transcription Meeting
Recognition Plan, http//www.nist.gov/speech/tests
/rt/rt2004/spring/documents/rt04s-meeting-eval-pla
n-v1.pdf
Speaker Segmentation and Clustering in Meetings
by Qin Jin et al.
Can be found in RT 2004 Spring Meeting
Recognition Workshop

18
Lightly supervised Training (6 pages)

Lightly supervision in acoustic model training
gt 1000 hours training (by BBN) using TDT (Topic
detection tracking) corpus
The corpus (totally 1400 hrs)
Contains News from ABC/CNN (TDT2), MSNBC and NBC
(TDT3 and 4)
Lightly supervised training, using only
closed-caption transcription, not transcribed by
human.
Decoding as a second opinion
Adapted results BL (hub4) WERR 12.7
-gt tdt4 12.0 -gt tdt2 11.6 tdt3 10.9
-gt w MMIE 10.5

19
How does it work?

Require very strict automatic selection criterion
What kills the recognizer is insertion and
deletion of phrases.
CC The republican leadership council is going
to air ads promoting Ralph Nadar
Actual The republican leadership council, a
moderate group, is going to air ads the Green
Party candidate, Ralph Nadar.
-gt Corrupt phoneme alignments.

20
(No Transcript)
21
Point out the Error Biased LM for lightly
supervise decoding

Instead of using standard LM
Use LM with biased on the CC LM
Arguments Good recognizer can figure out whether
there is error.
However, it is not easy to automatically know
that there is an error.
High Biased of LM will result in low WERR in
certain CC.
Can point out error better.
However, High Biased of LM cause recognizer
making same errors as CC.
Make recognizer biased to the CC
Authors the art is such as way the
recognizer can confirm correct words . and point
out the errors

22
Selection of Sentences Lightly supervised
decoding

Lightly supervised decoding
Use a 10xRT decoder to run through 1400 hrs of
speech. (1.5 year in 1 single processor machine)
Authors It takes some time to run.
Selection
Only choose the files with 3 or more contiguous
words correct (Or files with no error)
Only 50 data is selected. (around 700 hrs)

23
Model Scalability and Conclusion

No. of hours from 141h -gt 843h
Speakers from 7k -gt 31k
Codebooks from 6k -gt 34k
Gaussians from 164k -gt 983k

24
Conclusion and Discussion

A new challenge for speech recognition
Are we using the right method in this task?
Is increasing the number of parameters correct?
Will more complex models (n-phones, n-grams) work
better in cases gt 1000 hrs?

25
Related work in ICASSP 2004

Lightly supervised acoustic model using consensus
network (LIMSI on TDT4 Mandarin)
Improving broadcast news transcription by lightly
supervised discriminative training (Very similar
work by Cambridge.)
Use a faster decoder (5xRT)
Discriminative training is the main theme.

Write a Comment

User Comments (0)