Title: Speech to Speech Machine Translation (S2SMT)
1Speech to Speech Machine Translation (S2SMT)
- Kapita Selekta, 26 November 2005
- Suyanto
2Overview
- Motivation
- S2SMT System
- Applications
- Conclusion
- Discussion
3VerbMobil
4Motivation
- 6500 living languages www.ling.gu.se
- Translation Market Donald Barabé 2003
- 8 Billion Global Market
- Doubling every five years
5Motivation (cont.)
6Motivation (cont.)
Customer Service Department in a China company
with 10,000 employees (Chinese and English)
Staffers Words/Day Translation Time (hrs) Percentage
Customer Feedback 235,000 350 39
Prospective Orders 150,000 228 25
Technical Support 156,000 235 26
Dealer Feedback 60,000 93 10
Total 601,000 906 100
7Questions
- Is it possible to develop S2SMT for the problems?
- What are the challanges?
8S2SMT
Source Language Utterance
Source Language Text
Target Language Text
Target Language Utterance
Kurt Godden 2002
9ASR (Automatic Speech Recognition)
10ASR Challenges
- Co-articulation
- Speaker independence
- dialect variations
- non-native speakers
- Spontaneous speech
- Disfluencies
- Out-of-vocabulary words
- Noise robustness
- Convolutive recording/transmission conditions
- Additive recording environment, transmission SNR
- Intra-speaker variability stress, age, humor
- Prosody
- Intonation, stress, and phrase boundaries
- Emotion
11ASR approaches
- Word-based ASR (1970s)
- Recognize a word as a whole pattern
- For special purposes isolated digits, connected
words - How many words you need to develop an
application? - Syllable-based ASR
- Recognize a word as a set of syllable patterns
- In English, there are about 10,000 syllables.
- Phoneme-based ASR (widely used today)
- Recognize a word as a set of phoneme patterns
- Considerable for general purposes
- In English, there are 50 phonemes.
- In Indonesian, only 32 phonemes.
12Phoneme-based ASR
- Today, it is the most realistic approach.
- In English, we need only 50 phonemes to be
recognized. - To develop speech corpus, we should develop a
sentence set with tri-phone (sil-asil, c-asil,
c-an) balance. - For 50 phonemes, we need about 125,000
tri-phones. - For 10,000 syllable, we need 1012 tri-syllables
(complicated!)
13Todays ASR
Phoneme-based approach using statistical models
(HMM or hybrid HMM/ANN) for acoustics and
linguistics Large vocabulary, speaker
independent T. Dutoit 2002
14Language Model
n-gram models (trigram is widely used for
ASR) Probability of a sentence is estimated from
the conditional probabilities of each word given
the n-1 preceding words T. Dutoit 2002
P(The red hat linux) P(The_,_) P(redThe,_)
P(hatred,The) P(linuxhat,red)
Solve coarticulation/dialect hat, had, head,
heat, hate ...
15Language Model (cont.)
- An example in Bahasa Indonesia.
- Satu kantornya
- Satukan TOR-nya
- If the sentences preceded by tolong
- Tolong satu kantornya
- Tolong satukan TOR-nya
- 3-gram is better than 2-gram
16IBM trigram example
17IBM trigram example (cont.)
18Language Model (cont.)
- Advantages
- Robust and efficient
- Increase accuracy from 85 to 97 T. Dutoit
- Problems
- Limited only the local linguistic structure
- a vocabulary of size V will have Vn n-grams
- e.g. 20,000 words will have 8 trillion trigrams!
19ASR Performance
T. Dutoit 2002 - Faculty Polytechnique de Mons
Belgium
20Today ASR
- Large Vocabulary Continues Speech Recognition
(LVCSR) - Minimum Vocabulary 10,000 words
- Continues Speech
- Speaker Independent
21Machine Translation (MT)
22MT Challenges
- Orthographic Variations
- Ambiguous spelling
- Ambiguous word boundaries
- Lexical Ambiguity
- Eat ? essen (human) vs fressen (animal)
- ? he-wrote vs. it-was-written vs. books
23MT Challenges
- Morphological Variations
- Affixation vs. RootPattern
- write ? written
- kill ? killed
- do ? done
- Translation Divergences
24MT Approaches
- Grammar-based
- - Interlingua-based
- - Transfer-based
- Direct
- - Example-based
- - Statistical
MT Pyramid
Nizar Habash 2004 - Columbia University
25Multi-Engine Machine Translation
- Idea take output from different translation
engines and get an overall better translation - Get the best from different worlds
- High quality but low coverage from translation
memory, interlingua system - High coverage but lower quality from statistical
system - How to get a better translation ?
- Select one translation, i.e. work on sentence
level - Create a new one, i.e. using partial translations
from different engines and create a new one
26Text-to-Speech (TTS)
27TTS Challenges
- Accurate automatic phonetization (?dictionary
look-up) - Prosody generation (i.e., intonation and phoneme
durations) must be coherent easy to produce
unnatural prosody - Synthesize phoneme sequences with corresponding
prosody - Co-articulation!
- Segmental quality should be maintained after
pitch and duration modification - Engineering
- Low design and maintenance cost
- Low computational and Memory cost
- Easy adaptation to other languages
28TTS Diagram
T. Dutoit 2002 - Faculty Polytechnique de Mons
Belgium
29Automatic Phonetization
30Automatic Phonetization
More complex than that !
31Intonation
- Why ups and downs?
- Stress (word level) ? Accent (phrase level)
- Modify slightly ? unnatural
32Phoneme Duration
- Not constant
- Not fixed for a given phoneme
- Linked to intonation (longer on accented
syllables)
33Applications of S2SMT
Joy (Ying Zhang) 2003 CMU
34VerbMobil
35ATT
36S2SMT Advantages
- Data transmission
- Voice format to text format
- GSM 8 kbps to 40 bps (reduce 200 times!)
Can you put S2SMT to the client ?
37S2SMT Disadvantages
- The original voice of the speaker?
- Not natural intonation, emotion
- Delay
38Conclusion
- At the moment, it is possible to develop S2SMT
for small special purposes, e.g. reservation,
helpdesk, etc. - The main problem is ASR
- MT and TTS are considerably acceptable
- Many remaining challenges in S2SMT
39Discussion
Takezawa et al. 98
40Discussion
Takezawa et al. 98
41Discussion
- How about Bahasa Indonesia?
- Population 240 million people
- PT TELKOM has 30 million (12.5) customers (fixed
and wireless) - Other operators has millions customers (say 20
million) - Prospective market for S2SMT ???
42Discussion
- TELKOM RisTI, ATR (www.atr.jp), and STTTelkom are
developing Indonesian text and speech corpus - Text corpus
- 5,000 sentences
- Extracted from news and application domain
- Speech corpus
- 400 people (200 male and 200 female)
- 4 dialects Javanese, Sundanese, Jakarta, Batak
- 4 age categories 18-23, 24-35, 36-50, 51-60
- We need 100 students to be uttered!!!
43Discussion
- Target Applications (not translation)
- E-governance (status tracking IMB, PBB, KTP,
etc.) - Billing info
- Audio conference (Reservation)
- Tele Home-Security
- Dumb and Deaf Telecommunication System
- To develop S2SMT, we need experts in linguistic,
computer science, electronics engineering,
communications, etc.
44Thank you