Title: Toward Spontaneous Speech Recognition and Understanding
1Toward Spontaneous Speech Recognitionand
Understanding
Sadaoki Furui
- Tokyo Institute of Technology
- Department of Computer Science
- 2-12-1, O-okayama, Meguro-ku, Tokyo, 152-8552
Japan - Tel/Fax 81-3-5734-3480
- furui_at_cs.titech.ac.jp
- http//www.furui.cs.titech.ac.jp/
2Outline
0205-03
- Spontaneous speech corpora
- Robust speech recognition
- Spontaneous speech recognition
- Speech understanding
- Speech summarization
3Speech recognition technology
0010-11
natural conversation
2-way dialogue
transcription
network agent intelligent messaging
word spotting
system driven dialogue
digit strings
name dialing
office dictation
form fill by voice
directory assistance
voice commands
40204-03
5Difficulties in (spontaneous) speech recognition
? Lack of systematic understanding in
variability Structural or functional
variability Parametric variability
? Lack of complete structural representations
of (spontaneous) speech
? Lack of data for understanding non-structural
variability
6Outline
0205-03
- Spontaneous speech corpora
- Robust speech recognition
- Spontaneous speech recognition
- Speech understanding
- Speech summarization
7Spontaneous speech corpora
0205-04
- Spontaneous speech variations extraneous words,
out-of-vocabulary words, ungrammatical sentences,
disfluency, partial words, repairs, hesitations,
repetitions, style shifting, . - Theres no data like more data Large
structured collection of speech is essential. - How to collect natural data?
- Labeling and annotation of spontaneous speech is
difficult how do we annotate the variations, how
do the phonetic transcribers reach a consensus
when there is ambiguity, and how do we represent
a semantic notion?
8Spontaneous speech corpora (cont.)
0205-05
- How to ensure the corpus quality?
- Research in automating or creating tools to
assist the verification procedure is by itself an
interesting subject. - Task dependency It is desirable to design a
task-independent data set and an adaptation
method for new domains ? Benefit of a reduced
application development cost.
90202-13
Main database characteristics (Becchetti
Ricotti)
100202-14
Further database characteristics (Becchetti
Ricotti)
11Overview of the Science and Technology Agency
Priority ProgramSpontaneous Speech Corpus and
Processing Technology
0010-24
120112-05
C S J
For training a morphological analysis and
POS tagging program
For speech Recognition
Spontaneous monologue
Core Manually tagged with segmental and
prosodic information 500k words
Digitized speech, transcription, POS
and speaker information 7M words
Overall design of the Corpus of Spontaneous
Japanese (CSJ)
13Outline
0205-03
- Spontaneous speech corpora
- Robust speech recognition
- Spontaneous speech recognition
- Speech understanding
- Speech summarization
14Main causes of acoustic variation in speech
0010-17
15Robust speech recognition
0104-10
Robust against voice variation due to
individuality, the physical and psychological
condition of the speaker, telephone sets,
microphones, network characteristics, additive
background noise, speaking styles, etc. Few
restrictions on tasks and vocabulary Essential
to develop automatic adaptation techniques
Unsupervised, on-line, incremental adaptation is
ideal the system works as if it were a
speaker/task- independent system, and it
performs increasingly better as it is used
16Mismatch between training and testing (C.-H. Lee)
0012-03
170204-05
Principle of adaptive noise canceling
n2(n) is filtered to produce an output estimate
of n1(n) which is subtracted from the primary
input y(n) to give the enhanced signal
180204-08
Principle of a spectral mapping method from noisy
speech to clean speech
190204-10
Environment adaptation by spectral transformation
200204-09
HMM decomposition (after Varga and Moore)
21HMM composition process for creating a noisy
speech HMMas a product of two source HMMs
0011-22
220103-36
Structure of the neural network for HMM noise
adaptation
23Speaker-independent recognition is limited
Phone models are too broad, so phones can
overlap. Individual speaker characteristics
are not utilized. Speaker adaptation /
normalization
24Framework of adaptive learning
0010-18
25Outline
0205-03
- Spontaneous speech corpora
- Robust speech recognition
- Spontaneous speech recognition
- Speech understanding
- Speech summarization
260201-02
Test-set perplexity and OOV rate for the two
language models
27Word accuracy for each combination of models
75 70 65 60 55 50 45 40
280204-13
Mean and standard deviation for each attribute of
presentation speech
Acc word accuracy (), AL averaged acoustic
frame likelihood, SR speaking rate (number of
phonemes/sec), PP word perplexity, OR out of
vocabulary rate, FR filled pause rate (),
RR repair rate ()
290201-05
Acc word accuracy, OR out of vocabulary rate,
RR repair rate, FR filled pause rate, SR
speaking rate, AL averaged acoustic frame
likelihood, PP word perplexity
Summary of correlation between various attributes
300202-03
Linear regression models of the word accuracy
() with the six presentation attributes
Speaker-independent recognition Acc 0.12AL -
0.88SR - 0.020PP - 2.2OR 0.32FR - 3.0RR
95 Speaker-adaptive recognition Acc 0.024AL -
1.3SR - 0.014PP - 2.1OR 0.32FR - 3.2RR 99
Acc word accuracy, SR speaking rate, PP word
perplexity, OR out of vocabulary rate, FR
filled pause rate, RR repair rate
31Outline
0205-03
- Spontaneous speech corpora
- Robust speech recognition
- Spontaneous speech recognition
- Speech understanding
- Speech summarization
32Human speech generation and recognition process
0011-11
33A communication - theoretic view ofspeech
generation recognition
0010-22
P
(
X W
)
P
(
W M
)
P
(
M
)
Message source
Linguistic channel
Acoustic channel
Speech
M
X
W
recognizer
Language Vocabulary Grammar Semantics Context Habi
ts
Speaker Reverberation Noise Transmission
characteristics Microphone
349903-4
359809-1
360204-11
Generic block diagram for spontaneous speech
understanding
370204-12
Generic semi-automatic language acquisition for
speech understanding
38An architecture of a detection-based speech
understanding system
0010-23
39Outline
0205-03
- Spontaneous speech corpora
- Robust speech recognition
- Spontaneous speech recognition
- Speech understanding
- Speech summarization
40Sayings
0205-06
- The shortest complete description is the best
understanding Ockham - If I had more time I could write a shorter letter
B. Pascal - Make everything as simple as possible
A. Einstein
410205-07
From speech recognition to summarization
LVCSR (Large Vocabulary Continuous Speech
Recognition) systems can transcribe read speech
with 90 word accuracy or higher.
Current target LVCSR systems for spontaneous
speech recognition to generate closed captions,
abstracts, etc.
Spontaneous speech features filled pauses,
disfluency, repetition, deletion, repair,
etc. Outputs from LVCSR systems
include recognition errors
Automatic speech summarization
Important information extraction
420205-08
Summarization levels
Information extraction Understanding speech
Summarization
Indicative summarization
Informative summarization
430205-09
Automatic speech summarization system
44Approach to speech summarizationutterance by
utterance
0205-10
Each transcribed utterance
3
4
2
5
7
9
6
8
1
10
A set of words is extracted (sentence compaction)
Specified ratio e.g. Extracting 7 words from 10
words 70
3
2
7
1
6
8
9
Summarized (compressed) sentence
45Target of summarized speech
0205-11
Maintaining original meanings of speech as much
as possible
46Summarization score
0205-12
Summarized sentence with M words V v1 ,v2 ,, vM
470205-13
Linguistic score
Linguistic likelihood of word strings
(bigram/trigram) in a summarized sentence
log P (vm vm-2 vm-1)
Linguistic score is trained using a
summarization corpus.
480205-14
Word significance score
Amount of information
fi Number of occurrences of ui in the
transcribed speech ui Topic word in the
transcribed speech Fi Number of occurrences of
ui in all the training articles FA Summation of
all Fi over all the training articles
(FA SFi ) i
- Significance scores of words other than topic
words and reappearing topic words are fixed.
49Confidence score
0205-15
Acoustic and linguistic reliability of a word
hypothesis
C(wk,l) Log posterior probability of wk,l k,l
Node index in a graph wk,l Word hypothesis
between node k and node l a Forward probability
from the beginning node S to node k ß
Backward probability from node l to the end node
T Pac Acoustic likelihood of wk,l Plg
Linguistic likelihood of wk,l PG Forward
probability from the beginning node S to the
end node T
Posterior probability
50Word concatenation score
0205-16
A penalty for word concatenation with no
dependency in the original sentence
Inter-phrase
Intra-phrase
Intra-phrase
the beautiful cherry blossoms
in Japan
Phrase 2
Phrase 1
Grammatically correct but incorrect as a summary
the beautiful Japan
510205-17
Dependency structure
Dependency Grammar
Left-headed dependency
Right-headed dependency
The cherry blossoms bloom in spring
Phrase structure grammar for dependency
DCFG (Dependency Context Free Grammar)
a ? ba (Right-headed dependency) a ? ab
(Left-headed dependency) a ? w a, b
Non-terminal symbols, wTerminal symbols
520205-18
Word concatenation score based on SDCFG
Word dependency probability
If the dependency structure between words is
ambiguous,
If the dependency structure between words is
deterministic,
0 or 1
SDCFG (Stochastic DCFG)
The dependency probability between wm and wl, d
(wm, wl, i, k, j) is calculated using
Inside-Outside probability based on SDCFG.
T(wm, wn) m n-1 L j
log S S S S d (wm, wl, i, k, j)
i1 km jn ln
S Initial symbol, a, b Non-terminal symbol,
w Word
530205-19
Dynamic programming for summarizing each utterance
Selecting a set of words maximizing the
summarization score
- g (m, l, n) maxg(m-1, k, l)
- k lt l
- logP(wnwk wl)
- lII(wn) lCC(wn)
- lT Tr(wl, wn)
Ex. ltsgt w2 w4 w5 w7 w10 lt/sgt
lt/sgt w10 w9 w8 w7 w6 w5 w4 w3 w2 w1 ltsgt
Transcription result
Summarized sentence
54Summarization of multiple utterances
0205-20
The method of summarizing each utterance is
extended to summarize a set of multiple
utterances by adding a rule giving a
restriction at utterance boundaries.
550205-21
Dynamic programming for summarizing multiple
utterances
Initial and terminal symbols cannot be
skipped. Word concatenation score is not
applied to the utterance boundaries.
560205-22
Evaluation experiments
40 or 70 summarization ratio
57Word network of manual summarization results for
evaluation
0205-23
Manual summarization results are merged into a
network.
- The network approximately expresses all possible
correct summarization - including subjective variations.
Summarization accuracy is defined as the word
accuracy based on the word string, extracted
from the word network, that is most similar to
the automatic summarization result.
Summarization accuracy Len-(SubInsDel)
/Len100 Len number of words in the
most similar word string in the network Sub
number of substitution errors Ins number of
insertion errors Del number of deletion
errors
580205-25
Examples of automatic summarization for manually
transcribed CNN news - 1
- Transcription
- Its sulfur, and as Ed Garsten reports in
todays edition of tech trends, - the petroleum industry is proposing a cleanup.
- Automatic summarization (30-40 summarization
ratio) - sulfur Ed Garsten reports tech petroleum
proposing cleanup. - The most similar word string in the manual
summarization network - Ed Garsten reports tech trends industry
proposing cleanup. - Automatic summarization (50-70 summarization
ratio) - sulfur Ed Garsten reports in todays edition
tech trends - petroleum industry is proposing cleanup.
- The most similar word string in the manual
summarization network - Sulfur, Garsten reports in todays tech trends
- the industry is proposing cleanup.
590205-26
Examples of automatic summarization for manually
transcribed CNN news - 2
- Transcription
- We are dealing with something of such a
massive uh size and - potential impact, um that a lot of people
wisely are saying hands off. - Automatic summarization (20-40 summarization
ratio) - Were dealing something impact lot of
people saying hands . - The most similar word string in the manual
summarization network - Were dealing something such impact lot of
people saying hands off. - Automatic summarization (50-70 summarization
ratio) - Were dealing with something of a size and
impact, - a lot of people wisely are saying hands .
- The most similar word string in the manual
summarization network - Were dealing with something of such size and
impact, - a lot of people wisely are saying hands off.
600205-27
Examples of automatic summarization for
recognized CNN news (80 recognition accuracy)
610205-28
English news speech summarization (Each utterance
summarization)
620205-29
English news speech summarization (Multiple
utterance summarization)
630205-32
Recognition error reduction
64Summary
- How to model and recognize spontaneous speech is
one of the most important issues. - Construction of a large-scale spontaneous speech
corpus is crucial. - How to cope with additive noise and intra- and
inter-speaker variability are also the most
crucial issues. - Paradigm shift from recognition to understanding
is needed. - Speech summarization is attractive as information
extraction and speech understanding.
65References
- C. Becchetti and L. P. Ricotti Speech
Recognition, John Wiley Sons, Ltd., New York,
2000 - S. Furui Digital Speech Processing, Synthesis,
and Recognition, Second Edition, Signal
Processing and Communications Series, Marcel
Dekker, New York, 2000 - D. Gibbon, I. Mertins and R. K. Moore (Eds.)
Handbook of Multimodal and Spoken Dialogue
Systems, Kluwer Academic Publishers, Boston, 2000 - B.-H. Juang and S. Furui Automatic recognition
and understanding of spoken language processing
A first step toward natural human-machine
communication, Proc. IEEE, 88, 8, pp. 1142-1165,
2000 - C. Hori, S. Furui, R. Malkin, H. Yu and A.
Waibel Automatic summarization of English
broadcast news speech, Proc. Human Language
Technology 2002, San Diego, pp. 228-233, 2002 - X.-D. Huang, A. Acero and H.-W. Hon Spoken
Language Processing, Prentice Hall PTR, New
Jersey, 2001 - S. Young and G. Bloothooft (Eds.) Corpus-based
Methods in language and Speech Processing, Kluwer
Academic Publishers, Boston, 1997