Title: Chinese Romanization for Chinese Voice Browsing
1Chinese Romanization for Chinese Voice Browsing
2Index
- Motivations Proposals
- IPA. VS. Chinese Romanization
- Chinese Romanization Standards
- Implementations of Chinese Romanization in SSML
- Extensions for other languages
3Motivations Proposals
4IBM Speech Synthesis System
- IBM speech synthesis system support about 20
languages. - For Asian Language, we cover
- Mandarine,
- Cantonese,
- Korean,
- Japanese,
- Thai.
5Pronunciations Annotations are important for
Chinese
- A Chinese character represents a meaning more
than a pronunciation. - The homograph phenomenon is very common for
Chinese characters. - So it will be very helpful if the pronunciation
can be given explicitly.
6Proposals
- We propose to use Chinese Romanization to
annotate Chinese pronunciation in phoneme
element. - We also propose SSML to use diverse predefined
and widely used pronunciation annotation
standards for different languages. - Thus SSML can be more easily accepted and used
around the world. - Note Chinese Romanization Hanyu Pinyin in this
PPT.
7IPA. VS. Chinese Romanization
8Comparison Rule Goal of SSML
- The goal of SSML is to provide a rich, XML-based
markup language for assisting the generation of
synthetic speech in Web and other applications. - To reach the goal, we need more and more users of
SSML, such as ordinary Web applications
developers, to learn and use the SSML easily. - So, we need to define the SSML based on ordinary
peoples knowledge and skill rather than
professional linguistics knowledge. - Otherwise, it will be a long way for SSML be
widely accepted and used around the world.
9IPA is not very fit for Chinese
- IPA tries to collect an exhaustive set of
pronunciations for all kinds of languages. - It has become very complicated and difficult to
input. - A well educated Chinese adult can not annotate
Chinese Pronunciation in IPA without special
training. - IPA is not very popular in China.
- Special linguistic phenomena in Chinese, such as
tone, retroflex, can not be conveniently
described by IPA.
10Chinese Romanization is fit for Chinese
- Chinese Romanization is specially designed only
for Chinese instead of all languages. - Adding r in the end to describe a retroflex
syllable. - Adding tone attribute to describe the tone.
- Chinese Romanization is widely used and learnt.
- Chinese people learn Chinese Romanization in
primary school. - Many foreigners begin to learn Chinese by Chinese
Romanization. - Chinese Romanization is widely used to input
Chinese Characters on computer. - Chinese government has brought into effect a
standard for Chinese Romanization. - It is in effect for education, publishing,
information processing and other related
industries in China.
11Chinese Romanization Standards
12Chinese Romanization Standard
- The writing rules of Chinese Romanization conform
to P.R.C state standard Basic rules for Hanyu
Pinyin Orthography 1 published by (CSBQTS) in
1996. - This Orthography is based on Hanyu Pinyin
Schema published in 1958. - According to the naming method of alphabet, we
propose to use x-CSBQTS-96 to represent Chinese
Romanization alphabet. However, we also propose
to use x-Pinyin-96, which is easier to
remember. - CSBQTS China State Bureau of Quality and
Technical Supervision
13Hanyu Pinyin Schema (published in 1958)
- Character Set.
- 25 characters, all from a to z except ü.
- (For easy to input on computer ü is replaced by
v.) - Initial Set
- b, p m, f, d, t, n, l, g, k, h, j, q, x, zh, ch,
sh, r, z, c, s - Final Set
- i, u, ü, a , ia, ua, o, uo, e, ie, eü, ai, uai,
ei, uei, - ao, iao, ou, iou, an, ian, uan, üan, en, in,
uen, ün - ang, iang, uang, eng, ing, ueng, ong, iong,
- Tone Annotation
- ma , má, ma, mà, ma
- Separator '
- piao
14Pinyin VS. IPA
15Basic rules for Hanyu Pinyin Orthography
(published in 1996)
- 1. Words are the basic units for spelling the
Chinese Common Language. (Space is used to
separate Word) - rén (person/people), péngyou (friends),
túshuguan (library/libraries) - worén hé nóngmín (Workers and Farmers)
- 2. Structures of two or three syllables that
indicate a complete concept are linked - quánguó (the whole nation), duìbuqi (sorry),
- 3. Separate terms with more than 4 syllables if
they can be separated into words, otherwise link
all the syllables - wúfèng gangbi (seamless pen), Hóngshízìhuì (Red
Cross)
16Basic rules for Hanyu Pinyin Orthography
(published in 1996)
- 4. Reduplicated monosyllabic words are linked,
but reduplicated disyllabic words are separated - rénrén (everybody), chángshi chángshi (give it a
try) - 5. In certain situations, for the purpose of
making it convenient to read and understand the
words, a hyphen can be added - huán-bao (environmental protection), shíqi-ba suì
(17 or 18 years old)
17Implementations of Chinese Romanization in SSML
18Implementation 1
- lt?xml version"1.0"?gt
- ltspeak version"1.0" xmlns"http//www.w3.org/2001
/10/synthesis" - xmlnsxsi"http//www.w3.org/2001/XMLSche
ma-instance" - xsischemaLocation"http//www.w3.org/200
1/10/synthesis - http//www.w3.org/TR/speech-syn
thesis/synthesis.xsd" - xmllang"zh-CH"gt
- ltphoneme alphabet" x-CSBQTS-96" ph"duìbuqi"gt
??? lt/phonemegt - lt!-- This is an example of Chinese Romanization
- Standard Tone Annotation--gt
- lt/speakgt
-
19Implementation 2
- lt?xml version"1.0"?gt
- ltspeak version"1.0" xmlns"http//www.w3.org/2001
/10/synthesis" - xmlnsxsi"http//www.w3.org/2001/XMLSche
ma-instance" - xsischemaLocation"http//www.w3.org/200
1/10/synthesis - http//www.w3.org/TR/speech-syn
thesis/synthesis.xsd" - xmllang"zh-CH"gt
- ltphoneme alphabet"x-CSBQTS-96"
ph"dui4bu0qi3"gt ??? lt/phonemegt - lt!-- This is an example of Chinese Romanization
- using number to describe tone --gt
- lt/speakgt
20Comparison between Two implementations
- Implementation 1
- ltphoneme alphabet" x-CSBQTS-96" ph"duìbuqi"gt
??? lt/phonemegt - Implementation 2
- ltphoneme alphabet"x-CSBQTS-96"ph"dui4bu0qi3"gt
??? lt/phonemegt - Note "x-CSBQTS-96" may be replaced by
"x-Pinyin-96"
21Extensions for other languages
22Extension for Cantonese
- The Linguistic society of Hong Kong has published
a simple, easy-to-learn and easy-to-use LSHK
Cantonese Romanization Scheme in 1993. - This scheme is widely adopted in various areas
education, Cantonese information process and
computer input method, etc. - So we also propose to use The LSHK Cantonese
Romanization Scheme to annotate Cantonese
pronunciation.
23Extension for more languages
- Though it is possible to form up a general
standard to annotate all languages
pronunciation, such a standard may become very
complex to use. - Another way is to use the predefined and widely
accepted pronunciation annotation standards for
different language. - At least, these diverse standards should be an
important complement to the general standard.
24Thank you!
25Korea Romanization
It is used in our Korea Speech Synthesis System.
26Japanese Romanization
- Japanese
- ??????????? ???????
- Japanese Romanization
- mada oboeteiru deshou nami oto ni tsutsumarete
- English meaning
- Do you remember being surrounded by the sound of
tide?
27Discussion of Word
- What is the definition of Word in Chinese?
- Prosodic Word or Grammar Word
- ???????ni lái háishi bù lái?
- Is ?? a word?
- What is the difference between Word break?
- The misunderstanding problem can be solved by
adding break. - Can Word information be handled by Hanyu Pinyin
Orthography? - In Hanyu Pinyin Orthography, space is used to
separate words.