Title: Kein Folientitel
1Verbmobil Multilingual Processing of Spontaneous
Speech
Wolfgang Wahlster
German Research Center for Artificial
Intelligence, DFKI GmbH Stuhlsatzenhausweg
3 66123 Saarbruecken, Germany phone (49 681)
302-5252/4162 fax (49 681) 302-5341 e-mail
wahlster_at_dfki.de WWWhttp//www.dfki.de/wahlster
2Mobile Speech-to-Speech Translation of
Spontaneous Dialogs
As the name Verbmobil suggests, the system
supports verbal communication with foreign dialog
partners in mobile situations.
1
face-to-face conversations
2
telecommunication
3Mobile Speech-to-Speech Translation of
Spontaneous Dialogs
Verbmobil Speech Translation Server
Solution Conference Call The Verbmobil Speech
Translation Server is accessed by GSM
mobile phones.
4Verbmobil is a Multilingual System
It supports bidirectional translation between
5Challenges for Language Engineering
Input Conditions
Naturalness
Adaptability
Dialog Capabilities
Close-Speaking Microphone/Headset Push-to-talk
Speaker Dependent
Isolated Words
Monolog Dictation
Read Continuous Speech
Speaker Independent
Information- seeking Dialog
Telephone, Pause-based Segmentation
Increasing Complexity
Open Microphone, GSM Quality
Spontaneous Speech
Speaker adaptive
Multiparty Negotiation
Verbmobil
6Context-Sensitive Speech-to-Speech Translation
Wann fährt der nächste Zug nach Hamburg ab?
When does the next train to Hamburg depart?
Wo befindet sich das nächste Hotel?
Where is the nearest hotel?
Verbmobil Server
Final Verbmobil Demos l CeBIT-2000
(Hannover) l COLING-2000
(Saarbrücken) l ECAI-2000
(Berlin)
7Verbmobil The First Speech-Only Dialog
Translation System
Mobile DECT Phone
Mobile GSM Phone
8Verbmobil The First Speech-Only Dialog
Translation System
Mobile DECT Phone
Mobile GSM Phone
9Verbmobil The First Speech-Only Dialog
Translation System
Mobile DECT Phone
Mobile GSM Phone
10Verbmobil The First Speech-Only Dialog
Translation System
Mobile DECT Phone
Mobile GSM Phone
11Verbmobil The First Speech-Only Dialog
Translation System
Mobile DECT Phone
Mobile GSM Phone
German Speaker Verbmobil neuer Teilnehmer
hinzufügen. (Speech command
to initiate a conference call)
Verbmobil Bitte sprechen Sie jetzt die
Telephonnummer Ihres
Gesprächspartners.
12Verbmobil The First Speech-Only Dialog
Translation System
Mobile DECT Phone
Mobile GSM Phone
German Speaker Verbmobil neuer Teilnehmer
hinzufügen. (Speech command
to initiate a conference call)
Verbmobil Bitte sprechen Sie jetzt die
Telephonnummer Ihres
Gesprächspartners.
German Speaker 0681/302 5253
13Verbmobil The First Speech-Only Dialog
Translation System
Mobile DECT Phone
Mobile GSM Phone
German Speaker Verbmobil neuer Teilnehmer
hinzufügen (Speech command to
initiate a conference call)
Verbmobil Bitte sprechen Sie jetzt die
Telephonnummer Ihres
Gesprächspartners.
German Speaker 0681/302 5253
14Verbmobil II Three Domains of Discourse
Scenario 1 Appointment Scheduling
Scenario 2 Travel Planning
Scenario 3 Remote PC Maintenance
When?
What? When? Where? How?
When? Where? How?
Focus on temporal expressions
Focus on temporal and spatial expressions
Integration of special sublanguage lexica
Vocabulary Size 2500/6000
Vocabulary Size 15000/30000
Vocabulary Size 7000/10000
15The Control Panel of Verbmobil
16From a Multi-Agent Architecture to a
Multi-Blackboard Architecture
Verbmobil I
Verbmobil II
? Multi-Agent Architecture
? Multi-Blackboard Architecture
M3
M1
M2
M3
M1
M2
Blackboards
BB 2
BB 1
BB 3
M4
M5
M6
M4
M5
M6
? Each module must know, which module produces
what data ? Direct communication between
modules ? Each module has only one instance ?
Heavy data traffic for moving copies around ?
Multiparty and telecooperation applications are
impossible ? Software ICE and ICE Master ?
Basic Platform PVM
? All modules can register for each blackboard
dynamically ? No direct communication between
modules ? Each module can have several
instances ? No copies of representation
structures (word lattice, VIT chart) ?
Multiparty and Telecooperation applications
are possible ? Software PCA and Module
Manager ? Basic Platform PVM
17A Multi-Blackboard Architecture for the
Combinationof Results from Deep and Shallow
Processing Modules
Command Recognizer
Channel/Speaker Adaptation
Audio Data
Spontaneous Speech Recognizer
Prosodic Analysis
18A Multi-Blackboard Architecture for the
Combinationof Results from Deep and Shallow
Processing Modules
Command Recognizer
Channel/Speaker Adaptation
Audio Data
Spontaneous Speech Recognizer
Prosodic Analysis
Statistical Parser
Chunk Parser
Word Hypotheses Graph with Prosodic Labels
Dialog Act Recognition
HPSG Parser
19A Multi-Blackboard Architecture for the
Combinationof Results from Deep and Shallow
Processing Modules
Command Recognizer
Channel/Speaker Adaptation
Audio Data
Spontaneous Speech Recognizer
Prosodic Analysis
Statistical Parser
Chunk Parser
Word Hypotheses Graph with Prosodic Labels
Dialog Act Recognition
HPSG Parser
Semantic Construction
Semantic Transfer
VITs Underspecified Discourse Representations
Robust Dialog Semantics
Generation
20Verbmobil as the First Dialog Translation System
that Uses Prosodic Information Systematicallyat
All Processing Stages
Speech Signal
Word Hypotheses Graph
Multilingual Prosody Module Prosodic
features l duration l pitch l energy
l pause
Boundary Information
Boundary Information
Sentence Mood
Accented Words
Prosodic Feature Vector
Dialog Act Segmentation and Recognition
Search Space Restriction
Lexical Choice
Speaker Adaptation
Constraints for Transfer
Speech Synthesis
Dialog Understanding
Translation
Parsing
Generation
21Integrating Shallow and Deep Analysis Components
in a Multi-Blackboard Architecture
Augmented Word Hypotheses Graph
Chunk Parser
Statistical Parser
HPSG Parser
22Integrating Shallow and Deep Analysis Components
in a Multi-Blackboard Architecture
Augmented Word Hypotheses Graph
Chunk Parser
Statistical Parser
HPSG Parser
partial VITs
Chart with a combination of partial VITs
partial VITs
partial VITs
23Integrating Shallow and Deep Analysis Components
in a Multi-Blackboard Architecture
Augmented Word Hypotheses Graph
Chunk Parser
Statistical Parser
HPSG Parser
partial VITs
Chart with a combination of partial VITs
partial VITs
partial VITs
Robust Dialog Semantics Combination and
knowledge- based reconstruction of complete VITs
Complete and Spanning VITs
24Verbmobils Massive Data Collection Effort
Transliteration Variant 1 Transliteration Variant
2 Lexical Orthography Canonical
Pronounciation Manual Phonological Segmentation
3,200 dialogs (182 hours) with 1,658
speakers 79,562 turns distributed on 56 CDs, 21.5
GB
Automatic Phonological Segmentation Word
Segmentation Prosodic Segmentation Dialog
Acts Noises
Superimposed Speech Syntactic Category Word
Category Syntactic Function Prosodic Boundaries
The so-called Partitur (German word for musical
score) orchestrates fifteen strata of annotations
25Extracting Statistical Properties from Large
Corpora
Segmented Speech with Prosodic Labels
Treebanks Predicate- Argument Structures
Annotated Dialogs with Dialog Acts
Aligned Bilingual Corpora
Transcribed Speech Data
Machine Learning for the Integration of
Statistical Properties into Symbolic Models for
Speech Recognition, Parsing, Dialog Processing,
Translation
Neural Nets, Multilayered Perceptrons
Probabilistic Transfer Rules
Hidden Markov Models
Probabilistic Automata
Probabilistic Grammars
26VHG A Packed Chart Representation of Partial
Semantic Representations
l Incremental chart construction and anytime
processing
l Chart Parser using cascaded finite-state
transducers (Abney, Hinrichs)
Semantic Construction
27VHG A Packed Chart Representation of Partial
Semantic Representations
l Incremental chart construction and anytime
processing
l Chart Parser using cascaded finite-state
transducers (Abney, Hinrichs) l Statistical LR
parser trained on treebank (Block, Ruland)
Semantic Construction
28VHG A Packed Chart Representation of Partial
Semantic Representations
l Incremental chart construction and anytime
processing
l Chart Parser using cascaded finite-state
transducers (Abney, Hinrichs) l Statistical LR
parser trained on treebank (Block, Ruland) l
Very fast HPSG parser (see two papers at
ACL99, Kiefer, Krieger et al.)
Semantic Construction
29VHG A Packed Chart Representation of Partial
Semantic Representations
l Incremental chart construction and anytime
processing
l Chart Parser using cascaded finite-state
transducers (Abney, Hinrichs) l Statistical LR
parser trained on treebank (Block, Ruland) l
Very fast HPSG parser (see two papers at
ACL99, Kiefer, Krieger et al.)
Semantic Construction
30VHG A Packed Chart Representation of Partial
Semantic Representations
l Incremental chart construction and anytime
processing l Rule-based combination and
transformation of partial UDRS coded as VITs
l Chart Parser using cascaded finite-state
transducers (Abney, Hinrichs) l Statistical LR
parser trained on treebank (Block, Ruland) l
Very fast HPSG parser (see two papers at
ACL99, Kiefer, Krieger et al.)
Semantic Construction
31VHG A Packed Chart Representation of Partial
Semantic Representations
l Incremental chart construction and anytime
processing l Rule-based combination and
transformation of partial UDRS coded as VITs l
Selection of a spanning analysis using a bigram
model for VITs (trained on a tree bank of 24 k
VITs)
l Chart Parser using cascaded finite-state
transducers (Abney, Hinrichs) l Statistical LR
parser trained on treebank (Block, Ruland) l
Very fast HPSG parser (see two papers at
ACL99, Kiefer, Krieger et al.)
Semantic Construction
32Robust Dialog Semantics Deep Processing of
Shallow Structures
Goals of robust semantic processing (Pinkal,
Worm, Rupp) l Combination of unrelated analysis
fragments l Completion of incomplete analysis
results l Skipping of irrelevant fragments
Method Transformation rules on VIT Hypothesis
Graph Conditions on VIT structures ?
Operations on VIT structures The rules are based
on various knowledge sources l lattice of
semantic types l domain ontology l
sortal restrictions l semantic
constraints Results 20 analysis is improved,
0.6 analysis gets worse
33Robust Dialog Semantics Combining and
Completing Partial Representations
Let us meet (in) the late afternoon to catch the
train to Frankfurt
meet
the late afternoon
to catch
the train
to Frankfurt
Let us
The preposition in is missing in all paths
through the word hypothesis graph. A temporal NP
is transformed into a temporal modifier using a
underspecified temporal relation temporal_np
(V1) ? typeraise_to_mod (V1, V2) V2 The
modifier is applied to a proposition type
(V1, prop), type (V2, mod) ?apply (V2, V1, V3)
V3
34The Understanding of Spontaneous Speech Repairs
I need a car next Tuesday oops Monday
Editing Phase
Repair Phase
Original Utterance
Reparans
Hesitation
Reparandum
Recognition of Substitutions
Transformation of the Word Hypothesis Graph
I need a car next Monday
Verbmobil Technology Understands Speech Repairs
and extracts the intended
meaning Dictation Systems like ViaVoice,
VoiceXpress, FreeSpeech, Naturally Speaking
cannot deal with spontaneous speech
and transcribe the corrupted
utterances.
35Integrating a Deep HPSG-based Analysis with
Probabilistic Dialog Act Recognition for
Semantic Transfer
HPSG Analysis
Probabilistic Analysis of Dialog Acts (HMM)
Robust Dialog Semantics
Dialog Act Type
VIT
Dialog Act Type
Recognition of Dialog Plans (Plan Operators)
Semantic Transfer
36Integrating a Deep HPSG-based Analysis with
Probabilistic Dialog Act Recognition for
Semantic Transfer
HPSG Analysis
Probabilistic Analysis of Dialog Acts (HMM)
Robust Dialog Semantics
Dialog Act Type
VIT
Dialog Act Type
Recognition of Dialog Plans (Plan Operators)
Semantic Transfer
Dialog Phase
37Combining Statistical and Symbolic Processing for
Dialog Processing
Dialog-Act based Translation
Dialog Module
Context Evaluation
Statistical Prediction
Dialog Act Predictions
Context Evaluation
Main Proprositional Content
Focus
Plan Recognition
Dialog Phase
Transfer by Rules
Dialog Act
Dialog-Act based Translation
Dialog Memory
Dialog Act
Generation of Minutes
38Using Context and World Knowledgefor Semantic
Transfer
Example Platz ? room / table / seat
Nehmen wir dieses Hotel, ja. ? Let us take
this hotel. Ich reserviere einen Platz. ? I
will reserve a room.
1
Machen wir das Abendessen dort. ? Let us have
dinner there. Ich reserviere einen Platz. ?
I will reserve a table.
2
Gehen wir ins Theater. ? Let us go to the
theater. Ich möchte Plätze reservieren. ? I
would like to reserve seats.
3
All other dialog translation systems translate
word-by-word or sentence-by-sentence.
39Automatic Generation of Multilingual Protocolsof
Telephone Conversations
Dialog Translation by Verbmobil
Multilingual Generation of Protocols
HTML-Document in German Transferred
by Internet or Fax
HTML-Document in English Transferred
by Internet or Fax
German Dialog Partner
American Dialog Partner
40Integrating Deep and Shallow Processing
Combining Results from Concurrent Translation
Threads
Segment 1 If you prefer another hotel,
Segment 2 please let me know.
41Integrating Deep and Shallow Processing
Combining Results from Concurrent Translation
Threads
Segment 1 If you prefer another hotel,
Segment 2 please let me know.
Statistical Translation
Dialog-Act Based Translation
Semantic Transfer
Case-Based Translation
Alternative Translations with Confidence Values
42Integrating Deep and Shallow Processing
Combining Results from Concurrent Translation
Threads
Segment 1 If you prefer another hotel,
Segment 2 please let me know.
Statistical Translation
Dialog-Act Based Translation
Semantic Transfer
Case-Based Translation
Alternative Translations with Confidence Values
Selection Module
Segment 1 Translated by Semantic Transfer
Segment 2 Translated by Case-Based Translation
43Verbmobil Long-Term, Large-Scale Funding and
Its Impact
l Funding by the German Ministry for Education
and Research BMBF Phase I
(1993-1996)
33 M Phase II (1997-2000)
28 M l 60 Industrial
funding according to shared cost model 17
M l Additional RD investments of industrial
partners 11 M
Total 89 M
- l gt 800 Publications (gt600 refereed) l gt Many
Patents - l gt 17 Commercial Spin-off Products l gt 6
Spin-off Companies - gt 900 trained Researchers for l
gt Product Announcement - German Language Industry for
GSM version in 2001 -
- Philips, DaimlerChrysler and Siemens are leaders
in Spoken Dialog - Applications
44More than 80 of Verbmobils Translations are
Approximately Correct
- Large-Scale Web-based Evaluation 25 345
Translations, 65 Evaluators
- Sentence Length 1 - 60 Words
Percentage of Approximately Correct Translation
Word Accuracy ? 50 5069 Turns
Word Accuracy ? 75 3267 Turns
Word Accuracy ? 80 2723 Turns
Translation Thread Case-based Translation Statisti
cal Translation Dialog-Act based
Translation Semantic Transfer Substring-based
Translation Automatic Selection Manual Selection
37 69 40 40 65 57 / 78 88
44 79 45 47 75 66 / 83 95
46 81 46 49 79 68 / 85 97
After Training with Instance-based Learning
Algorithm
45Checklist for Final Verbmobil System I
Three Domains Appointment Scheduling, Travel
Planning, PC Hotline Bi-directional and
speaker-independent translation in the
domains appointment scheduling and travel
planning Translation pairs German ? English,
German ? Japanese Vocabulary Size 10 000 for
German , Equivalent English Lexicon,
2500 for Japanese Operational
Success Criteria Word recognition rate (16
kHz) German spontaneous 75
(cooperative 85) English spontaneous
72 (cooperative 82) Japanese
spontaneous 75 (cooperative
85) (8kHz) spontaneous 70
(cooperative 80) 80 of the translations
are approximately correct and the dialog task
success rate should be around 90. The
average end-to-end processing time should be four
times real time (length of the input signal)
46Checklist for Final Verbmobil System II
The system can work in the open microphone mode
and cope with speech over GSM mobile
phones. Verbmobil can be controlled by speech
commands. A spelling mode is integrated into the
speech recognizer. The speech recognizers can
cope with simple non-speech input (like
coughing). Spontaneous speech phenomena like
repairs, hesitations and agreement failures can
be handled. The language identification and
speech recognition components are implemented as
separate components. A three-party conference
call with Verbmobil and a foreign partner can be
initiated by one speaker. A high-quality speech
synthesis for German and American English is
realized.
47Checklist for Final Verbmobil System III
Prosodic information is used for input
segmentation. Unknown words can be identified
and processed. Robust semantic processing
integrates partial analysis results of the
competing parsing approaches. The selection of
the translation result is based on a dynamic
choice function based on confidence values
computed by competing translation threads. Some
translation ambiguities can be resolved by the
exploitation of world and context knowledge, so
that the translation quality is
improved. Verbmobil can generate various forms
of dialog protocols in German and English.
48Results of the Verbmobil Project have been used
in 17 Spin-Off Products by the Industrial
Partners DaimlerChrysler, Philips and Siemens
49Successful Technology Transfer 6 High-Tec
Spin-Off Companies in the Area of
LanguageTechnology have been founded by
Verbmobil Researchers
50Verbmobil was the Key Resource for the Education
and Training of Researchers and Engineers Needed
to Build Up Language Industry in Germany
51SmartKom Intuitive Multimodal Interaction
Project Budget 34 M Project Duration 4 years
Main Contractor Project Management Testbed Softwar
e Integration DFKI Saarbrücken
The SmartKom Consortium
European Media Lab
MediaInterface
IMS Institut für Maschinelle Sprachverarbeitung,
Universität Stuttgart
Ludwig-Maximilians- Universität München
52SmartKom-PublicA Multimodal Communication Booth
53SmartKom-Mobile A Handheld Communication
Assistant
54 Verbmobil is a Very Large Dialog System
l 69 modules communicate via 224 blackboards l
HPSG for German uses a hierarchy of 2,400
types l 15,385 entries in the semantic
database l 22,783 transfer rules and 13,640
microplanning rules l 30,000 templates for
case-based translation l 691,583 alignment
templates l 334 finite state-transducers
55Additional Information about Verbmobil during
COLING
- 2 Tutorials Klüter/Reithinger Verbmobil
Development and Integration - Müller HPSG
- 11 Presentations at main conference (regular
papers and project notes) - - Probabilistic Parsing
- - Tense Translation
- - Selection of Translation Results
- - Statistical Translation (4)
- - HPSG Parsing
- - Semantic Construction
- - Self Corrections
- Verbmobil Demos at the COLING exhibition
-
1
2
3
56Conclusion I
Real-world problems in language technology like
the understanding of spoken dialogs,
speech-to-speech translation and multimodal
dialog systems can only be cracked by the
combined muscle of deep and shallow processing
approaches. In a multi-blackboard architecture
based on packed representations on all processing
levels (speech recognition, parsing, semantic
processing, translation, generation) using charts
with underspecified representations (eg. UDRS)
the results of concurrent processing threads can
be combined in an incremental fashion.
?
?
57Conclusion II
?
All results of concurrent processing modules
should come with a confidence value, so that a
selection module can choose the most promising
result at a each processing stage. Packed
representations together with formalisms for
underspecification capture the uncertainties in a
each processing phase, so that the uncertainties
can be reduced by linguistic, discourse and
domain constraints as soon as they become
applicable.
?
58Conclusion III
Deep Processing can be used for merging,
completing and repairing the results of shallow
processing strategies. Shallow methods can be
used to guide the search in deep processing. Stat
istical methods must be augmented by symbolic
models (eg. Class-based language modelling, word
order normalization as part of statistical
translation). Statistical methods can be used to
learn operators or selection strategies for
symbolic processes.
?
?
?
?
It is much more than a balancing act... (see
Klavans and Resnik 1996)
59Additional Results (not promised in the project
proposal)
l English speech recognition for telephone input
(DaimlerChrysler) l Two additional translation
engines case-based (ALI, DFKI) and
substring-based translation (LTrans,
Siemens) l An additional protocol mode
(baseline protocol, DFKI) Open
Problems l Integrating top-down knowledge into
basic speech recognition processes l
Exploiting more knowledge about human
interpretation strategies l More robust
translation of turns with very low word accuracy
rates l More systematic use of expert knowledge
about the domain of discourse
60URL of this Presentation www.dfki.de/wahlster/v
m-final
Thank you for your attention