Title: Bootstrapping a LanguageIndependent Synthesizer
1Bootstrapping a Language-Independent Synthesizer
- Craig Olinsky
- Media Lab Europe / University College Dublin
- 15 January 2002
2Introducing the Problem
- Given a set of recordings and transcriptions in
an arbitrary language, can we quickly and easily
build a speech synthesizer? - YES, if we know something about the language.
- However, for the majority of languages for which
such resources dont exist
3Starting from Sample
- PROS
- The existing synthesizer provides a store of
linguistic knowledge we can start from. - Analogue to speaker adaptation in Speech
Recognition systems. - Overall, quality should be better.
- CONS
- Difficulty related to degree of different between
sample and target language. - Best as a gradual process accent/dialect, not
language
4Starting from Scratch
- PROS
- Difficulty directly proportional to complexity of
the language. - Common (machine-learning) procedure based upon
machine learning from recordings and transcript.
- CONS
- Dont have a great deal of relevant knowledge to
apply to the task. - If not using principled phone set, necessary to
segment / label recordings cleanly
5The Obvious Compromise
- Take what we do know from building speech
synthesis, and generalize it to an existing
framework. - -- were not specifically learning from scratch
- -- at the same time, were not making linguistic
assumptions pre-coded into the source voices
6Generic Synthesis Framework/Toolkit
- Set of Scripts, Utilities, and Definition files
to help to help to automate the creation of
reasonable speech synthesis voices from an
arbitrary language without the need for
linguistic or language-specific information. - Build on top of the Festival Speech Synthesis
System and FestVox toolkit (for wave form
synthesis most of text processing and
pronunciation handling externalized to
locally-developed tools)
7Language-Dependent Synthesis Components
- Phone set
-
- Word pronunciation (lexicon and/or
letter-to-sound rules) - Token processing rules (numbers etc)
-
- Durations
- Intonation (accents and F0 contour)
- Prosodic phrasing method
8Phoneme Sets
- If we rely on a pre-existing set of pronunciation
rules, lexicon, etc., we are automatically
limited to using the phone-set used in those
resources (or something which they can be mapped
to) most likely something language-dependent. - IPA, SAMPA something language-universal?
- We need to generate pronunciations how do we
create the relationship between our training
database / phonetic representation / orthography?
9Multilingual Phoneme Sets IPA, SAMPA
- We dont want to be stuck with a set of phonemes
targeted for a specific language, so we instead
use a phoneme definition designed to be inclusive
of all - But this still assumes we know the relationship
between the phone set and orthography of the
language i.e. for any given text we can generate
a pronunciation. - This approach still assumes linguistic knowledge!
10Orthography as Pronunciation
- cf R. Singh, B. Raj and R.M. Stern, Automatic
Generation of Phone Sets and Lexical
Transcriptions .. - Suppose we begin with the orthography of the
written language. - e.g. CAT c a t DOG d o g
- This implies
- A relation between number of characters in a
spelling and the length of the pronunciation - The orthography of a language is consistent /
efficient
11Orthography as Pronunciation
12Implications for Data Labeling and Training
13Non-Roman Orthography Questions of Transcription
14Difficulties in Machine Learning of Pronunciation
- But there is a much more fundamental
problem in that it crucially assumes that
letter-to-phoneme correspondences can in general
be determined on the basis of information local
to a particular portion of the letter string.
While this is clearly true in some languages
(e.g. Spanish), it is simply false for others. - It is unreasonable to expect that good
results will be obtained from a system trained
with no guidence of this kind, or with data
that is simply insufficient to the task. - Sproat et. al, Multilingual Text-to-Speech
Synthesis The Bell Labs Approach, pp.76-77
15Lexicon / Letter-to-Sound Rules
16Token Processing
17Duration and Stress Modeling
18Intonation and Phrasing
19Unit Selection and Waveform Synthesis
20Overview Adaptation for Accent and Dialect
21Final Points
22(No Transcript)