Title: Automatic speech recognition of CantoneseEnglish codemixing utterances
1Automatic speech recognition of Cantonese-English
code-mixing utterances
Joyce Y. C. Chan, P. C. Ching, Tan Lee and
Houwei Cao Department of Electronic
Engineering The Chinese University of Hong Kong,
Hong Kong SAR, China
2Reference
- 11 Joyce Y. C. Chan, P. C. Ching and Tan Lee,
Development of a Cantonese-English Code-mixing
Speech Corpus, in Proc. of Eurospeech 2005, pp.
1533-1536, Lisbon, 2005 - 13 Joyce Y. C. Chan, P. C. Ching, Tan Lee and
Helen M. Meng, Detection of Language Boundary in
Code-switching Utterances by Bi-phone
Probabilities, in Proc. of ISCSLP 2004, pp.
293-296, Hong Kong, 2004 - 6 Mirjam Wester Syllable Classification using
Articulatory- Acoustic Features, in Proc. of
Eurospeech 2003, pp. 233-236, Geneva,
Switerzerland, 2003 - 10 W. K. Lo, Tan Lee and P. C. Ching,
Development of Cantonese spoken language corpora
for speech applications,in Proc. of ISCSLP 1998,
pp. 102-107, Singapore, 1998
3Outline
- 1. Definition
- 2. Introduction
- 3. Acoustic modeling
- 4. Language modeling
- 5. Language boundary detection (LBD)
- 6. Experiment
- 7. Conclusion
41. Definition
- Code-switching
- John Gumperz,1982,
- The juxtaposition within the same speech exchange
of passages of speech belonging to two different
grammatical systems or sub-system - Code-mixing
- In Hong Kong, code switching tends to be
intra-sentential and switching involving
linguistic units above the clause level is rare,
hence the preference for the term "code-mixing"
in many studies - Ex
(Cantonese)
52. Introduction
- Hong Kong is a truly international city and most
people are Cantonese-English bilinguals. - Cantonese is usually the matrix language while
English is the embedded language that is often
used to better describe meanings, feelings and
phenomena in Hong Kong. - However, the English words uttered by many local
people do contain Cantonese accent (??), which
makes automatic speech recognition difficult.
62. Introduction (cont.)
- 2.0 Phonological structure of Cantonese and
English - Cantonese
- One of the major Chinese dialects which is a
Sino-Tibetan language - It is monosyllabic in nature and has a general
syllable structure C1VC2 - All the Cantonese syllables are of the canonical
forms V, CV, CVC or VC - English
- English is Indo-European language
- Phonological structure is much more complicated
than Cantonese. - In English discourse, over 80 of the syllables
are of the canonical form of Cantonese, and the
remainings are C, CC, CCV, VCC, CCCV, CCCVCC
72. Introduction (cont.)
- 2.1 Cantonese accent in the embedded English
words - This phenomenon is called borrowing. (1990)
- For Cantonese speakers, the borrowing words are
pronounced with the following characteristics - Softening or dropping the second consonant in a
CC sequence, e.g. plan /p l ae n/ is pronounced
as /p ae n/ - Softening or dropping the final stop consonant
e.g. check /ch eh k/ is pronounced as /ch eh/ - Adapting a monosyllabic word with fricative
endings to produce a disyllabic, e.g. notes /n
ow t s/ is pronounced as /n ow t s iy/ - Retroflex such as /r/ is read as /l/ sound or /w/
sound, e.g. pressure /p r eh sh er/ is
pronounced as /p l eh sh er/, and repeat /r
iy p iy t/ is pronounced as /w iy p iy t/ - If the phone exists in English only but not in
Cantonese, they will be pronounced as the similar
phones in Cantonese, such that /th/ becomes /f/,
and /eh/ becomes /ae/
82. Introduction (cont.)
- 2.2 Phone change and syllable fusion in Cantonese
- Hong Kong people do not use romanization systems
when they learn Chinese or Cantonese. People may
not know the correct pronunciation of the words,
and confuse a phoneme with the other. - Besides, syllable fusion may occur in fast
speech. The pronunciation of the second syllable
of disyllabic words may be ignored or changed.
For example, the word ?? /zi1 dou3/ may be
pronounced as /zi1 ou3/, ?? /gam1 jat6/
becomes /gam1 mat6/ . (Cantonese) - Lead to phone insertion or phone deletion
92. Introduction (cont.)
- Scenario
- 1. Preparing the monolingual and cross-lingual
acoustic models - 2. Preparing the modified pronunciation
dictionary - To handle accents in the code-switch words, the
phonetic sequence of the English lexicons in the
pronunciation dictionary is modified - 3. Preparing the language models
- Four different statistic language models are
proposed in order to solve the problem on the
lack of code-mixing training text data
102. Introduction (cont.)
- Scenario
- 4. Code-mixing speech recognizer
- Bilingual speech recognizer, which is syllable
based for Cantonese and word based for English. - Two-pass system
- First pass
- No language models are applied in the first pass.
- A lattice will be generated by the bilingual
speech recognizer, and language boundary (LB)
information will be integrated to the lattice by
re-scoring the acoustic scores of the hypothesis
words. - Two pass
- Language model scores will finally be integrated
to the lattice, and the Generalized Word
Posterior Probability (GWPP) will be derived. - According to the GWPP score, a character-based
hypothesis will then be obtained by best path
searching
113. Acoustic modeling
- Three speech corpora are involved in this
research - TIMIT Monolingual English corpus (native
speakers) - CUSENT Monolingual Cantonese corpus (newspaper
content) - CUMIX Cantonese-English code-mixing corpus
(CE, C, Modified lexicon)
No accents
Cross-lingual
- All the acoustic models are triphone models
- The language-dependent models are
monolingual(??), which includes 39 English
phones and 56 Cantonese phones.
123. Acoustic modeling (cont.)
- In model set C, similar phones of the two
languages are clustered, and therefore, the total
number of phones is reduced to 70. - The dictionary contains an average of 2.267
different pronunciations for each English
lexicon.
134. Language modeling
- Mixing between standard Chinese and spoken
Cantonese is another problem, since this will
involve different sets of lexicons and grammar. - Instead of searching for code-mixing text data,
we searched for spoken Cantonese text. - Articles that contain the selected spoken
Cantonese characters (those do not appear in
standard Chinese, e.g. ) are
selected. - Among the collected data, 10 of them are
code-mixing.
144. Language modeling (cont.)
- All the language models are tri-gram, which is
character based for Cantonese. - Monolingual language model (CAN_LM) consider
all English words as out-of-vocabulary (OOV). - Code-mixing language model (CS_LM) all English
words share the same probability. - Class-based language model (CLASS_LN) classify
the English words into 13 classes according to
their part-of-speech (POS) and meaning. The
classes are adjective, companies, date and time,
event and activities, fashion,food, brand name,
objects and tools, human name, place, sentence
and phrase, shops and restaurants, software, verb
and the remaining nouns. Most of the classes are
nouns since they are in major among code-switch
words. - Translation-based language model (TRANS_LN)
translate the English words into their Cantonese
equivalent if available otherwise, use the
classes in CLASS_LM. The language model is still
character-based, even if the corresponding
Cantonese contains multiple characters.
155. Language boundary detection (LBD) (cont.)
- General equation for intra-syllable bi-phone
probability is given by
165. Language boundary detection (LBD) (cont.)
- The same character may have different phone
sequences when it has different meanings. - For example, the character ? can be pronounced as
/haang/, /hong/ and /hang/ in different phrases.
The following example is to calculate the
probability that?is pronounced as /haang/.
175. Language boundary detection (LBD) (cont.)
g-am n-in j-au B OW N AH S g-e
Phone based gt
g_am n_in j_au B_OW OW_N N_AH AH_S g_e
Intra bi-phonegt
CAN
ENG
CAN
Probability gt
ENG
ENG
CAN
CAN
CAN
ENG(2)
CAN(3)
CAN(1)
ENG(1)
CAN(1)
186. Experiment
196. Experiment (cont.)
- However, when there are accents, the syllable
structure of the code-switch words changes.
Therefore, the English words would sound like
Cantonese words. - To tackle(??) problems due to accents, larger
units should be considered. - Hence, we propose to use a syllable-based LBD, or
apply LBD algorithms to the lattice generated by
a bilingual speech recognizer. - LBD approach based on lattice searches the
English word with the longest (WE) duration from
the word lattice.
206. Experiment (cont.)
217. Conclusion
- The duration of English words is longer than that
of Cantonese characters, since Cantonese is
monosyllabic. Hence, the lattice-based LBD
algorithm obtains a higher LBD accuracy. - When the correct language boundary is obtained,
the accuracy of the code-switch words can be
increased. - Therefore, studies on language boundary detection
are necessary for further research.