Title: EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
1EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES
FOR INDIAN LANGUAGES
- S P Kishore, Alan W Black, Rohit Kumar,
Rajeev Sangal - Language Technologies Research Center
- International Institute of Information
Technology, Hyderabad - Language Technologies Institute, Carnegie
Mellon University - Institute of Software Research International,
Carnegie Mellon University
2ORGANIZATION OF THE TALK
- Role of Language Technologies
- Text to Speech Systems
- Text Processing Front End
- Speech Generation Component
- Unit Selection Approach
- Experiments
- Choice of Unit Size
- Generation of Databases Content Size of
Database - Evaluation of Hindi Speech Synthesis System
- Applications
- Conclusion
3ROLE OF LANGUAGE TECHNOLOGIES
- Natural Interfaces for Information Access
- Crucial Role for Multilingual Societies
- Integration of Speech Recognition, Machine
Translation and Speech Synthesis - For Interaction between 2 people speaking
different languages
4INDIAN LANGUAGE TEXT TO SPEECH (TTS) SYSTEMS
- A Text to Speech System converts an arbitrary
given text into a corresponding spoken waveform. - Why Text to Speech Synthesis ?
Basic Blocks of a Text to Speech System
Basic Units Sequence Prosody Information
Text
Speech
5INDIAN LANGUAGE TEXT TO SPEECH (TTS) SYSTEMSTEXT
PROCESSING FRONT END
- Nature of Indian Scripts
- Basic units of Indian writing system are Aksharas
- An Akshara is typically of the form V, CV, CCV
- Common Phonetic Base
- About 35 Consonants and 18 Vowels
- Phonetic nature of languages - What is written
is what is spoken - Exception Schwa Deletion (Inherent Vowel
Suppression)
6INDIAN LANGUAGE TEXT TO SPEECH (TTS) SYSTEMSTEXT
PROCESSING FRONT END
- Format of Input Text
- ISCII, Unicode, Various Fonts
- Can be handled by use of appropriate conversion
module(s) - Mapping Non Standard Words to Standard Words
- NSW Symbols, digits, initials, abbreviations,
Punctuations, non-native words etc.
7INDIAN LANGUAGE TEXT TO SPEECH (TTS) SYSTEMSTEXT
PROCESSING FRONT END
- Standard Words to Phoneme Sequence
- Involves Lexicon Lookup and use of Letter to
Sound rules for English - Due to phonetic nature of Indian scripts, simple
letter to sound rules can be used - Problems with some languages
- Inherent Vowel Suppression (schwa deletion)
- e.g. ratana (rtana) is spoken as ratan
- Presently we are using set of Heuristic Rules
8INDIAN LANGUAGE TEXT TO SPEECH (TTS)
SYSTEMSSPEECH GENERATION COMPONENT
- ARTICULATORY MODEL BASED SYNTHESIS
- Involves simplistic modeling of human speech
production mechanism - Difficult to accurately model the motion of
articulators - PARAMETER BASED SYNTHESIS
- Speech segments are parameterized in terms of
formant frequencies or linear prediction
coefficients - Difficult to come up with large number of rules
to accurately manifest co articulation and
prosody - CONCATENATION BASED SYNTHESIS
- Inventory of recorded speech segments (units)
used - Prosodic Variations
- Intonation and duration could be acquired and
incorporated in the form of rules - Store multiple realizations of units with
differing prosody
9INDIAN LANGUAGE TEXT TO SPEECH (TTS)
SYSTEMSSPEECH GENERATION COMPONENT
- Unit Selection (Data Driven) Approach
- Multiple realizations of basic units with varying
prosodic features are stored in the speech
database - Storage and retrieval of large number of recorded
units is feasible in real time due to
availability of cheap memory and computation power
10UNIT SELECTION APPROACH
- Building up of Speech Databases
- Collection of optimal text corpuses
- Recording the text corpuses
- Automatic labeling followed by manual correction
of labels - Extraction of units features
- Clustering units to facilitate selection
11UNIT SELECTION APPROACH
- ISSUES INVOLVED
- Choice of Unit Size
- Sub words units half phone, phone, diphone,
syllable - Larger the unit size lesser the joins and lesser
the discontinuities - Also wide coverage of units in various contexts
desirables - Generation of Speech Databases
- Approach for Optimal Selection of Utterances
- Criteria for Unit Selection
- Most suitable units are selected from the
database on basis of minimization of target and
concatenation costs
12EXPERIMENTSCHOICE OF UNIT SIZE
- Hindi Synthesizers using different choices of
unit sizes built - Syllable, diphone, phone, half phones
- 24 sentences from Hindi news bulletin synthesized
- Perceptual Test on Native Hindi Speaking Subjects
conducted - AB Test
- Results
- Syllables performed better than diphones, phones
and half phones - Half phones performed better than diphones and
phones - Ref. S. P. Kishore, Alan W. Black, Unit Size in
Unit Selection Speech Synthesis, Eurospeech
2003, Geneva
13EXPERIMENTSCHOICE OF UNIT
- Example Utterances
- Half Phones
- Phones
- Diphones
- Syllables
14GENERATION OF SPEECH DATABASES
- Selection of utterances with wide phonetic and
prosodic coverage - High Frequency Syllables
- Syllable with relatively high occurrence in a
corpus - A sentence is selected if it has at least one
high frequency syllable not present in the
previous selected sentences - Utterances Recorded and Labeled
15GENERATION OF SPEECH DATABASES
SYLLABLE COVERAGE AND DURATION OF SPEECH
DATABASES
To Study Dependency of Quality on Coverage gtgt
16EXPERIMENTSGENERATION OF SPEECH DATABASES
- 6 databases with varying syllable coverage built
17EXPERIMENTSGENERATION OF SPEECH DATABASES
PERCEPTUAL TESTS
5 Subjects asked to listen to 5 sentences and
score them on a scale of 0 (worst) to 5 (Best).
Example Example Example Example Example Example
18EXPERIMENTSGENERATION OF SPEECH DATABASES
19EVALUTION OF HINDI SPEECH SYNTHESIS SYSTEM
- Text Processing Front End developed
- Support of Hindi text in Unicode
- Handles Non Standard words like
- Date, Currency, Digits, Address Abbreviations,
etc. - Schwa Deletion using Heuristic Rules
- 200 Sentences Synthesized
- 9 Native hindi speaking subjects evaluated
perceptual quality of the synthesizer - Each Subject evaluated nearly 40 sentences out of
the 200 - Scoring on a scale of 0 (worst) to 5 (Best)
- Words Not Sounding Natural were marked
20EVALUTION OF HINDI SPEECH SYNTHESIS SYSTEM
21EVALUTION OF HINDI SPEECH SYNTHESIS SYSTEM
- OBSERVATIONS
- 30 of Not Sounding Natural words were loan
words from English - Proper Nouns not being pronounced correctly
- Schwa Deletion rules not successfully deleting
schwa in some places - Some punctuations characters not getting handled
properly - LESSONS
- Additional Phonetic Coverage for proper nouns and
loan words required - Good text processing component needed for high
quality speech synthesis
22APPLICATIONS
- Talking Tourists Aid
- Limited Domain Synthesis
- Allows person to communicate queries about city,
travel, accomodation, etc. - News Reader
- Reading news from a Hindi News Portal
- Screen Reader for Visually Impaired
23CONCLUSION
- Syllables are better units for Indian Language
Speech Synthesis - Syllable gt Half Phone gt Diphone gt Phone
- High coverage of units produces high quality
speech. Also there would be less variance marking
higher consistency of results - Effects of Loan words should be considered in
design of speech corpus - Good text processing front end needed for high
quality synthesis
24