Title: CS 551651:
1CS 551/651 Structure of Spoken Language Lecture
5 Characteristics of Place of ArticulationPhone
tic Transcription John-Paul Hosom Fall 2008
2Acoustic-Phonetic Features Manner of
Articulation Approximately 8 manners of
articulation Name Sub-Types
Examples .
Vowel vowel, diphthong aa, iy, uw, eh, ow,
Approximants liquid, glide l, r, w,
y Nasal m, n, ng Stop unvoiced,
voiced p, t, k, b, d, g Fricative unvoiced,
voiced f, th, s, sh, v, dh, z,
zh Affricate unvoiced, voiced ch,
jh Aspiration h Flap dx,
nx Change in manner of articulation usually
abrupt and visible manner provides much
information about location of phonemes.
3Acoustic-Phonetic Features Place of
Articulation Approximately 8 places of
articulation for consonants Name Examples
. Labial p, b, m,
(w) Labio-Dental f, v Dental th,
dh Alveolar t, d, s, z, n, l Palato-Alveolar s
h, zh, ch, jh, r Palatal y Velar k,
g, ng, (w) Glottal h /l/ doesnt have same
coarticulatory properties as other alveolars
starts as alveolar (/t/, /d/), then becomes
palatal-alveolar /r/ is really a retroflex,
and has a complex place of articulation Place
of articulation more subject to coarticulation
than manner F2 trajectory important for
identifying place of articulation.
4- Acoustic-Phonetic Features Place of Articulation
-
- Labial (/p/, /b/, /m/, /w/)
- constriction (or complete closure) at lips
- the only unvoiced labial is /p/
- the only nasal labial is /m/
- characterized by F1, F2, (even) F3 of adjacent
vowel(s)rapidly and briefly decreasing at border
with labial
5- Acoustic-Phonetic Features Place of Articulation
-
- Labio-Dental (/f/, /v/)
- produced by constriction between upper lip and
lower teeth - in English, all labio-dental phonemes are
fricatives - can be characterized by formants of adjacent
vowel(s)decreasing at border with labial
(similar to characteristicsof labials) - Dental (/th/, /dh/)
- produced by constriction between tongue tip and
upper teeth(sometimes tongue tip is closer to
alveolar ridge) - in English, all dental phonemes are fricatives
- may be characterized by stronger energy above 6
KHz,but weaker than /sh/, /zh/ fricatives
6- Acoustic-Phonetic Features Place of Articulation
-
- Alveolar (/t/, /d/, /s/, /z/, /n/, /l/)
- tongue tip is at or near alveolar ridge
- a large number of English consonants are alveolar
- primary cue to alveolars F2 of neighboring
vowel(s)is around 1800 Hz, except for /l/ - /l/ has low F1 (? 400 Hz) and F2 (? 1000 Hz),
high F3 - /l/ before vowel is light /l/, after vowel is
dark /l/.
7- Acoustic-Phonetic Features Place of Articulation
-
- Palato-Alveolar (/sh/, /zh/, /ch/, /jh/, /r/)
- tongue is between alveolar ridge and hard palate
- 2 fricatives, 2 affricates, 1 retroflex
- retroflex has depression midway along tongue
- the palato-alveolar fricatives tend to have
strong energy due to weak constriction allowing
large airflow - /r/ (and /er/) most easily identified by F3 below
2000 Hz - Palatal (/y/)
- produced with tongue close to hard palate
- extreme production of /iy/
- F1-F2 tend to be more spread than /iy/, F1 is
lower than /iy/
8- Acoustic-Phonetic Features Place of Articulation
-
- Velar (/k/, /g/, /ng/)
- produced with constriction against velum (soft
palate) - only plosives /k/ and /g/, and nasal /ng/
- characteristic of velars is the velar pinch, in
whichF2 and F3 of neighboring vowel become very
closeat boundary with velar. More visible in
front vowel /ih/
9- Acoustic-Phonetic Features Place of Articulation
-
- Glottal (/h/)
-
- /h/ is the nominal glottal phoneme in English
inreality, the tongue can be in any vowel-like
position - the primary cue for /h/ is formant structure
withoutvoicing, an energy dip, and/or an
increase in aspirationnoise in higher
frequencies.
10- Distinctive Phonetic Features Summary
-
- Distinctive features may be used to categorize
phoneticsub-classes and show relationships
between phonemes - There is often not a one-to-one correspondence
between afeature value and a particular trait in
the speech signal - A variety of context-dependent and
context-independent cues (sometimes conflicting,
sometimes complimentary) serve to identify
features - Speech is highly variable, highly
context-dependent, andcues to phonemic identity
are spread in both the spectraland time domains.
The diffusion of features makesautomatic speech
recognition difficult, but human
speechrecognition is able to use this diffusion
for robustness.
11- Redundancy
-
- Distinctive features are not always independent
someredundancy may be implied (especially
binary features) - Example Spanish
high ? ?low low ? ?high ?back ?
?round round ? back low ? back low ?
?round ?back ? ?low round ? ?low These
relationships are language and feature-set
specific. (from Schane, p. 35-38)
12- Redundancy
-
- Redundant information can be indicated by
circling redundantfeatures
- Some redundancies are universal (cant be high
and low) - Phonetic sequences also have constraints
(redundant info.) - English has no more than 3 word-initial
consonants in this - case, first consonant is always /s/ next is
always /p/, /t/, or /k/ - third is always /r/ or /l/ (from Schane, p.
36-40)
13Phonetic Transcription Given a corpus of speech
data, its often necessary to create a
transcription word level phoneme
level time-aligned phoneme level
time-aligned detailed phoneme level (with
diacritics) other information phonetic
stress, emotion, syntax, repair Most common are
word-level and time-aligned phoneme level.
Time-aligned phonetic transcription
examples 0 110 .pau 110 180 h 180 240 e
h 240 280 l 280 390 ow 390 540 .pau
t
uw
.br
14Phonetic Transcription Are phonemes precise
quantities with exact boundaries? No humans
disagree on phonetic labels and boundary
positionsdisagreement may be a matter of
interpretation of the utterance. Phonetic
label agreement between humans
Full, Base Label Set 55 (English), 62 (German),
50 (Mandarin), 42 (Spanish) Broad
Categories 7 corresponding to manner of
articulation From Cole, Oshika, et al.,
ICSLP94
15- Phonetic Transcription
-
- 70 agreement on 55 phonemes, 89 agreement on 7
categories - Best phoneme-level automatic speech recognition
results on TIMIT, - with a 39-phoneme symbol set 75.8 (Antoniou and
Reynolds) - Differences
- Human agreement evaluated on spontaneous speech
(stories), TIMIT is read speech - Humans used 55 phonemes 39 phonemes for
evaluating TIMIT - Phoneme agreement doesnt translate into word
accuracy - human word accuracy is typically an order of
magnitude better - than the best automatic speech recognition system.
16Phonetic Transcription Phonetic label boundary
agreement between humans Agreement measured by
comparing two manual labelings, A and B, and
computing the percentage of cases in which B
labels are within some threshold (20 msec) of A
labels.
agreement ()
threshold (msec)
Average agreement of 93.8 within 20 msec
threshold Maximum agreement of 96 within 20 msec
17Phonetic Transcription Is there a correct
answer? No inherently subjective
although semi-arbitrary guidelines can be
imposed. Is measuring accuracy meaningless?
No phonemes do have identity and order, although
details may be subjective. Sometimes very
precise (if semi-arbitrary) labels and boundaries
are extremely important (e.g. concatenative
text-to-speech databases). What about getting a
computer to generate transcriptions, or at least
phonetic boundaries? Advantages consistent,
fast Disadvantages not accurate, compared to
human transcription not robust to
different speakers, environments
18- Phonetic Transcription
-
- Automatic Phonetic Alignment (assume phonetic
identity is known) -
- Two common methods
-
- Forced Alignment Use existing speech
recognizer, constrained to recognize only the
correct phoneme sequence. The search
process used by HMM recognizers returns both
phoneme identity and location. Location
information is boundary information. - (2) Dynamic Time Warping (a) Use
text-to-speech or utterance templates to
generate same speech content with known
boundaries. (b) Warp time - scale of reference (TTS or template) with
input speech to - minimize spectral error. (c) Convert known
boundary - locations to original time scale.
19Phonetic Transcription Accuracy of automatic
alignment Speaker-independent alignment using
Forced Alignment
agreement ()
threshold (msec)
20Phonetic Transcription Comparing manual and
automatic alignment of TIMIT corpus
- Automatic method still makes stupid mistakes.
- Manual labeling criteria not rigorously defined.
- Performance degrades significantly in presence
of noise. - Assumes correct phonetic sequence is known