CSE 551: - PowerPoint PPT Presentation

1 / 20
About This Presentation

CSE 551:


Title: PowerPoint Presentation Author: John Paul Hosom Last modified by: jph Created Date: 9/10/2001 2:07:35 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 21
Provided by: JohnPa179
Tags: cse | reynolds | stress


Transcript and Presenter's Notes

Title: CSE 551:

CSE 551 Structure of Spoken Language Lecture 6
Characteristics of Place of ArticulationPhonetic
Transcription John-Paul Hosom Fall 2004
Acoustic-Phonetic Features Manner of
Articulation Approximately 8 manners of
articulation Name Sub-Types
Examples .
Vowel vowel, diphthong aa, iy, uw, eh, ow,
Approximants liquid, glide l, r, w,
y Nasal m, n, ng Stop unvoiced,
voiced p, t, k, b, d, g Fricative unvoiced,
voiced f, th, s, sh, v, dh, z,
zh Affricate unvoiced, voiced ch,
jh Aspiration h Flap dx,
nx Change in manner of articulation usually
abrupt and visible manner provides much
information about location of phonemes.
Acoustic-Phonetic Features Place of
Articulation Approximately 8 places of
articulation for consonants Name Examples
. Labial p, b, m,
(w) Labio-Dental f, v Dental th,
dh Alveolar t, d, s, z, n, l Palato-Alveolar sh
, zh, ch, jh, r Palatal y Velar k, g,
ng, (w) Glottal h may start as alveolar
(/t/, /d/) followed by palatal-alveolar /r/ is
really a retroflex, and has a complex place of
articulation Place of articulation more subject
to coarticulation than manner F2 trajectory
important for identifying place of articulation.
  • Acoustic-Phonetic Features Place of Articulation
  • Labial (/p/, /b/, /m/, /w/)
  • constriction (or complete closure) at lips
  • the only unvoiced labial is /p/
  • the only nasal labial is /m/
  • characterized by F1, F2, (even) F3 of adjacent
    vowel(s)rapidly and briefly decreasing at border
    with labial

  • Acoustic-Phonetic Features Place of Articulation
  • Labio-Dental (/f/, /v/)
  • produced by constriction between upper lip and
    lower teeth
  • only fricatives are labio-dental in English
  • can be characterized by rising formants into
    adjacentvowels (similar to characteristics of
  • Dental (/th/, /dh/)
  • produced by constriction between tongue tip and
    upper teeth(sometimes tongue tip is closer to
    alveolar ridge)
  • only fricatives are labio-dental in English
  • may be characterized by stronger energy above 6
    KHz,but weaker than /sh/, /zh/ fricatives

  • Acoustic-Phonetic Features Place of Articulation
  • Alveolar (/t/, /d/, /s/, /z/, /n/, /l/)
  • tongue tip is at or near alveolar ridge
  • a large number of English consonants are alveolar
  • primary cue to alveolars F2 of neighboring
    vowel(s)is around 1800 Hz, except for /l/
  • /l/ has low F1 (? 500 Hz) and F2 (? 1000 Hz),
    high F3

  • Acoustic-Phonetic Features Place of Articulation
  • Palato-Alveolar (/sh/, /zh/, /ch/, /jh/, /r/)
  • tongue is between alveolar ridge and hard palate
  • 2 fricatives, 2 affricates, 1 retroflex
  • retroflex has depression midway along tongue
  • the palato-alveolar fricatives tend to have
    strong energy due to weak constriction allowing
    large airflow
  • /r/ (and /er/) most easily identified by F3 below
    2000 Hz
  • Palatal (/y/)
  • produced with tongue close to hard palate
  • extreme production of /iy/
  • F1-F2 tend to be more spread than /iy/, F1 is
    lower than /iy/

  • Acoustic-Phonetic Features Place of Articulation
  • Velar (/k/, /g/, /ng/)
  • produced with constriction against velum (soft
  • only plosives /k/ and /g/, and nasal /ng/
  • characteristic of velars is the velar pinch, in
    whichF2 and F3 of neighboring vowel become very
    closeat boundary with velar. More visible in
    front vowel /ih/

  • Acoustic-Phonetic Features Place of Articulation
  • Glottal (/h/)
  • /h/ is the nominal glottal phoneme in English
    inreality, the tongue can be in any vowel-like
  • the primary cue for /h/ is formant structure
    withoutvoicing, an energy dip, and/or an
    increase in aspirationnoise in higher

  • Distinctive Phonetic Features Summary
  • Distinctive features may be used to categorize
    phoneticsub-classes and show relationships
    between phonemes
  • There is often not a one-to-one correspondence
    between afeature value and a particular trait in
    the speech signal
  • A variety of context-dependent and
    context-independent cues (sometimes conflicting,
    sometimes complimentary) serve to identify
  • Speech is highly variable, highly
    context-dependent, andcues to phonemic identity
    are spread in both the spectraland time domains.
    The diffusion of features makesautomatic speech
    recognition difficult, but human
    speechrecognition is able to use this diffusion
    for robustness.

  • Redundancy
  • Distinctive features are not always independent
    someredundancy may be implied (especially
    binary features)
  • Example Spanish

i e a o u
High ? ? ?
Low ? ? ? ?
Back ? ?
Round ? ? ?
high ? ?low low ? ?high ?back ?
?round round ? back low ? back low ?
?round ?back ? ?low round ? ?low These
relationships are language and feature-set
specific. (from Schane, p. 35-38)
  • Redundancy
  • Redundant information can be indicated by
    circling redundantfeatures

i e a o u
High ? ? ?
Low ? ? ? ?
Back ? ?
Round ? ? ?
  • Some redundancies are universal (cant be high
    and low)
  • Phonetic sequences also have constraints
    (redundant info.)
  • English has no more than 3 word-initial
    consonants in this
  • case, first consonant is always /s/ next is
    always /p/, /t/, or /k/
  • third is always /r/ or /l/ (from Schane, p.

Phonetic Transcription Given a corpus of speech
data, its often necessary to create a
transcription word level phoneme
level time-aligned phoneme level
time-aligned detailed phoneme level (with
diacritics) other information phonetic
stress, emotion, syntax, repair Most common are
word-level and time-aligned phoneme level.
Time-aligned phonetic transcription
examples 0 110 .pau 110 180 h 180 240 e
h 240 280 l 280 390 ow 390 540 .pau
Phonetic Transcription Are phonemes precise
quantities with exact boundaries? No humans
disagree on phonetic labels and boundary
positionsdisagreement may be a matter of
interpretation of the utterance. Phonetic
label agreement between humans
Full Labels Base Labels Broad Categories
English 70 71 89
German 61 65 81
Mandarin 66 78 87
Spanish 74 82 90
Full, Base Label Set 55 (English), 62 (German),
50 (Mandarin), 42 (Spanish) Broad
Categories 7 corresponding to manner of
articulation From Cole, Oshika, et al.,
  • Phonetic Transcription
  • 70 agreement on 55 phonemes, 90 agreement on 7
  • Best phoneme-level automatic speech recognition
    results on TIMIT,
  • with a 39-phoneme symbol set 75.8 (Antoniou and
  • Differences
  • Human agreement evaluated on spontaneous speech
    (stories), TIMIT is read speech
  • Humans used 55 phonemes 39 phonemes for
    evaluating TIMIT
  • Phoneme agreement doesnt translate into word
  • human word accuracy is typically an order of
    magnitude better
  • than the best automatic speech recognition system.

Phonetic Transcription Phonetic label boundary
agreement between humans Agreement measured by
comparing two manual labelings, A and B, and
computing the percentage of cases in which B
labels are within some threshold (20 msec) of A
agreement ()
threshold (msec)
Average agreement of 93.8 within 20 msec
threshold Maximum agreement of 96 within 20 msec
Phonetic Transcription Is there a correct
answer? No inherently subjective
although semi-arbitrary guidelines can be
imposed. Is measuring accuracy meaningless?
No phonemes do have identity and order, although
details may be subjective. Sometimes very
precise (if semi-arbitrary) labels and boundaries
are extremely important (e.g. concatenative
text-to-speech databases). What about getting a
computer to generate transcriptions?
Advantages consistent, fast Disadvantages
not accurate, compared to human
transcription not robust to different
speakers, environments
  • Phonetic Transcription
  • Automatic Phonetic Alignment (assume phonetic
    identity is known)
  • Two common methods
  • Forced Alignment Use existing speech
    recognizer, constrained to recognize only the
    correct phoneme sequence. The search
    process used by HMM recognizers returns both
    phoneme identity and location. Location
    information is boundary information.
  • (2) Dynamic Time Warping (a) Use
    text-to-speech or utterance templates to
    generate same speech content with known
    boundaries. (b) Warp time
  • scale of reference (TTS or template) with
    input speech to
  • minimize spectral error. (c) Convert known
  • locations to original time scale.

Phonetic Transcription Accuracy of automatic
alignment Speaker-independent alignment using
Forced Alignment
agreement ()
threshold (msec)
Phonetic Transcription Comparing manual and
automatic alignment of TIMIT corpus
  • Automatic method still makes stupid mistakes.
  • Manual labeling criteria not rigorously defined.
  • Performance degrades significantly in presence
    of noise.
  • Assumes correct phonetic sequence is known
Write a Comment
User Comments (0)
About PowerShow.com