Bootstrapping a LanguageIndependent Synthesizer - PowerPoint PPT Presentation

About This Presentation

Title:

Bootstrapping a LanguageIndependent Synthesizer

Description:

Orthography as Pronunciation ... Difficulties in Machine Learning of Pronunciation 'But there is a much more fundamental problem ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 23

Provided by: Cra6170

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Bootstrapping a LanguageIndependent Synthesizer

1
Bootstrapping a Language-Independent Synthesizer

Craig Olinsky
Media Lab Europe / University College Dublin
15 January 2002

2
Introducing the Problem

Given a set of recordings and transcriptions in
an arbitrary language, can we quickly and easily
build a speech synthesizer?
YES, if we know something about the language.
However, for the majority of languages for which
such resources dont exist

3
Starting from Sample

PROS
The existing synthesizer provides a store of
linguistic knowledge we can start from.
Analogue to speaker adaptation in Speech
Recognition systems.
Overall, quality should be better.

CONS
Difficulty related to degree of different between
sample and target language.
Best as a gradual process accent/dialect, not
language

4
Starting from Scratch

PROS
Difficulty directly proportional to complexity of
the language.
Common (machine-learning) procedure based upon
machine learning from recordings and transcript.

CONS
Dont have a great deal of relevant knowledge to
apply to the task.
If not using principled phone set, necessary to
segment / label recordings cleanly

5
The Obvious Compromise

Take what we do know from building speech
synthesis, and generalize it to an existing
framework.
-- were not specifically learning from scratch
-- at the same time, were not making linguistic
assumptions pre-coded into the source voices

6
Generic Synthesis Framework/Toolkit

Set of Scripts, Utilities, and Definition files
to help to help to automate the creation of
reasonable speech synthesis voices from an
arbitrary language without the need for
linguistic or language-specific information.
Build on top of the Festival Speech Synthesis
System and FestVox toolkit (for wave form
synthesis most of text processing and
pronunciation handling externalized to
locally-developed tools)

7
Language-Dependent Synthesis Components

Phone set
Word pronunciation (lexicon and/or
letter-to-sound rules)
Token processing rules (numbers etc)

Durations
Intonation (accents and F0 contour)
Prosodic phrasing method

8
Phoneme Sets

If we rely on a pre-existing set of pronunciation
rules, lexicon, etc., we are automatically
limited to using the phone-set used in those
resources (or something which they can be mapped
to) most likely something language-dependent.
IPA, SAMPA something language-universal?
We need to generate pronunciations how do we
create the relationship between our training
database / phonetic representation / orthography?

9
Multilingual Phoneme Sets IPA, SAMPA

We dont want to be stuck with a set of phonemes
targeted for a specific language, so we instead
use a phoneme definition designed to be inclusive
of all
But this still assumes we know the relationship
between the phone set and orthography of the
language i.e. for any given text we can generate
a pronunciation.
This approach still assumes linguistic knowledge!

10
Orthography as Pronunciation

cf R. Singh, B. Raj and R.M. Stern, Automatic
Generation of Phone Sets and Lexical
Transcriptions ..
Suppose we begin with the orthography of the
written language.
e.g. CAT c a t DOG d o g
This implies
A relation between number of characters in a
spelling and the length of the pronunciation
The orthography of a language is consistent /
efficient

11
Orthography as Pronunciation
12
Implications for Data Labeling and Training
13
Non-Roman Orthography Questions of Transcription
14
Difficulties in Machine Learning of Pronunciation

But there is a much more fundamental
problem in that it crucially assumes that
letter-to-phoneme correspondences can in general
be determined on the basis of information local
to a particular portion of the letter string.
While this is clearly true in some languages
(e.g. Spanish), it is simply false for others.
It is unreasonable to expect that good
results will be obtained from a system trained
with no guidence of this kind, or with data
that is simply insufficient to the task.
Sproat et. al, Multilingual Text-to-Speech
Synthesis The Bell Labs Approach, pp.76-77

15
Lexicon / Letter-to-Sound Rules
16
Token Processing
17
Duration and Stress Modeling
18
Intonation and Phrasing
19
Unit Selection and Waveform Synthesis
20
Overview Adaptation for Accent and Dialect
21
Final Points
22
(No Transcript)

Write a Comment

User Comments (0)