Class 2 September 5 - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Class 2 September 5

Description:

The appearance of a spectrogram of a sampled speech signal s(n) depends ... Spectrograms (3D) of voiced phones and formants (2D) of some vowels ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 29
Provided by: ogar1
Category:

less

Transcript and Presenter's Notes

Title: Class 2 September 5


1
Class 2 (September 5)
  • The mechanisms of speech production
  • The vocal track, from the glottis to the lips and
    nose.
  • The units of speech
  • Linguistic representation of speech phonemes and
    phones of the language.
  • Phones of the English language and their
    classification (vowels and consonants, glides and
    liquids, nasals, fricatives, plosives). Voiced
    and unvoiced phones.
  • Place of articulation and mode of articulation.
  • Acoustic phonetics.
  • The tube model of the vocal tract and the wave
    equation for resonances.
  • Radiation at the lips of the acoustic waves.
  • Some aspects of the TIMIT database and others.

2
Speech production the vocal track, from the
glottis to the lips and nose
  • The vocal folds (there are no chords) is where
    the fundamental component of the speech wave
    originates (F0), which is perceived as pitch by
    the human ear (a perceptual measure)
  • It originates under air pressure from the lungs
    and tension (folding) of the vocal folds located
    in the glottis it is called the glottal pulse
    (notice that it is not sinusoidal!)
  • The vibrations of the main component are
    modulated via the articulators (place of
    articulation) in the rest of the vocal tract
    (lips/ and sometimes teeth, tongue, nose/ by
    closing the lips, alveolus/the ridge behind and
    above the front teeth, palate/the roof of the
    mouth, velum/the soft palate, or combinations)
  • As the pressure wave F0 travels down the (soft
    and nonlinear) vocal tract (voicing) and is
    articulated resonances occur originating additive
    multiples of the fundamental and its harmonics
    for each of the sounds
  • The sum of the fundamental and its harmonics give
    rise to a non-uniform frequency/amplitude
    spectrum which peaks characteristically at
    increasing frequencies in the formants F1, F2, F3
    and F4
  • There are sounds in which the vocal fold
    excitation does not occur and they are called
    unvoiced sounds (some fricatives, affricates and
    stops)

3
Oscilloscope trace of the goo in the time
domain ( 100 ms for each of the three parts)
the
A slight variation in F0?
g
oo
4
SIX VIEWS OF THE LARYNX DURING A VOICING CYCLE
Glottis and Vocal Folds
Ventricular or false folds
Vocal Folds
Glottis (the aperture)
Epiglottis

The glottis is here
GLOTTAL PULSES
5
My Glottis and Vocal Folds
6
Articulators
  • Tongue
  • Vocal folds (excitation others filter)
  • Lips
  • Teeth
  • Velum
  • Jaw

7
On one aspect of speech analysis
  • The appearance of a spectrogram of a sampled
    speech signal s(n) depends significantly on the
    window width used in the sampling (at 8KHz
    sampling we sample every 0.125 ms). Typically
    used windows (we will see different types later)
    have about 10-20 ms width. However, we can trade
    frequency for time resolution. Compare the
    wideband (short window 300 Hz, 10 ms) with the
    narrowband (45 Hz, 20 ms) spectrograms.

8
glottis
When we let the glottis vibrate (as in a
sustained vowel sound) without changing F0
(perceived as pitch) we get a spectrum (an
amplitude / frequency distribution) at the
glottis similar to the following diagram since
there are multiples of F0 (harmonics) in the
glottal pulses. After those frequencies go
through the vocal tract (above the glottis) they
get filtered by H(t) with resonances giving rise
to F1, F2 and F3.
H(t) Vocal tract filter function
Output from the lips
9
Spectrograms (3D) of voiced phones and formants
(2D) of some vowels (unrestricted air flow,
unlike consonants)
Identify F1, F2, F3 and, what happened to F0?
What are the resonances (poles) and the
anti-resonances (zeroes)? How do they originate?
F1
F3
F2
10
Spectrograms of English diphthongs
11
Units of Speech
  • Speech is a continuous wave of air pressure that
    is visualize with a microphone (plus a suitable
    amplifier) and an oscilloscope. It is best
    analyzed by breaking a sentence into smaller
    units. We must consider the continuity of the
    units coarticulation (transition from one unit to
    another)
  • Desirable characteristics of a speech unit for
    the purpose of recognition (Kai-Fu Lees SPHINX)
    1) Sensitivity (accounting for co-articulatory
    effects) 2) Trainability (not requiring an
    unreasonable number of training samples 3)
    Sharability (can be shared among larger units of
    speech)
  • Using too general units of speech increases
    sharability at the cost of insensitivity

12
Possible units of speech for HMM recognition
  • Words (particularly good in small vocabularies
    however for a 20,000 word vocabulary would
    require some 400,000 training examples and 1 GB
    of memory)
  • Syllables consists of a nucleus (vowel or
    diphthong) plus surrounding consonants (there are
    some 20,000 syllables in English)
  • Demisyllables are syllables cut in half at the
    middle of the vowel (no co-articulation). Its
    number is reduced to 1,000
  • Triphones or context dependent phones three
    phones (see below with the middle one being
    modified) Fewer than 503 - why? However, there
    are problems of excessive memory and large number
    for natural English. In a 1000 word vocabulary
    (SPHINX) there were some 2,381 triphone contexts
    using 24 MB. They had to be clustered.
  • Diphones fewer than 502 - why? If beginning and
    end transitions are matched they can be used for
    speech synthesis
  • Phones see discussion that follows on phones and
    phonemes
  • 1/60 second window has been used successfully in
    audio-visual speech driven simulation

13
Desirable characteristics of a speech unit
  • Sensitivity accounts for co-articulatory effects
  • Trainability is important when we consider the
    size of the training set (abundance of the unit)
  • Sharability when the unit is present in
    different samples of the training set it is more
    abundant and easier to train

14
Phones and phonemes
  • If we can imagine segments of speech points
    plotted in a highly dimensional space and
    clustered according to their perceived
    differentiability along a good number of
    features, we can then

Call the (red) centroid of a differentiable
cluster a phoneme and the points around it we
call allophones or just phones. A phoneme is an
abstract linguistic unit and the smallest
contrastive unit in the language. Phones are
specific instances of phonemes which vary because
of place of articulation, co-articulation or
individual variations of speakers, etc
Disclaimer this is an idealized picture
and usually there is no such clear
differentiation or contrast.
15
Phonemes as defined by the IPA
  • There are some 40 (some say 50) phonemes in the
    English language which may be classified
    (according to the intervention of glottal pulses)
    in voiced and unvoiced, according to the manner
    of articulation (vowels -13, diphthongs 3,
    glides -2, liquids -2, nasals - 3 , fricatives
    -9, stops -6, affricates 2). The place of
    articulation mentioned is the most common for the
    centroid and not unique for a phone corresponding
    to a given phoneme. See and study the next table.

16
Acoustics phonetics
(IPA)
bought
13 vowels 3 diphthongs
17
Because the IPA symbols are typographically
demanding, the SPHINX system used the following
36 more easily representable symbols to refer to
the phonemes used.
See if you can identify them in the chart of the
previous page
18
A model of the vocal tract
  • The production of speech may be modeled with two
    excitation sources and a transfer function (W is
    the angular frequency and l the length of the
    tube under consideration), as shown

Periodic wave generator in for voiced phonemes
Input UG ( W )
Output U (l, W)
Random noise generator in for unvoiced phonemes
19
Implementation of the vocal tract transfer
function model
  • It is interesting to study the vocal tract Xfer
    function models, but they are good only for
    either one or a group of phonemes and do not
    apply to continuous speech (a sequence of a
    co-articulated set of phones) unless they are
    morphed one into another as needed. The text has
    two models one is mechanical based on the
    acoustics of air flow in pipes and the other is
    electrical in the form of discrete transmission
    lines, but neither is applicable to all phonemes.
    We will study the former.

20
/e/
Rigid straight uniform tube model ¼ wavelength
(approximates the schwa)
  • The vocal tract transfer function as derived in
    section 3.5.2.2 of OShaughnessy is

where W is the radian frequency, l length of
the tube and clf speed of sound (340 m/s).
Assuming a length of 17 cm. where are the poles
of this transfer function and how do they show in
the output at what frequencies? (Questions like
this may show up in exams!) Other phonemes may be
simulated by concatenating tubes of different
diameters as the constrictions affect the formant
location
¼ wavewhy?
½ wave
What would we have to do to the tube model to
produce continuous speech? How the straight model
violates the actual shape of the vocal tract?
21
Radiation of the acoustic waves at the lips
  • Inside the vocal tract (inside the model tubes)
    we assume that the wavelength of the acoustic
    wave (depending on frequency, of course) is
    several times larger than the diameter of the
    vocal tract (2 cm) and that the wave is a flat
    pressure wave. What is the maximum frequency for
    which you could use this assumption? However, at
    the lips (which act as a radiating antenna) the
    situation is different and we have spherical
    pressure waves propagating almost isotropically
    as if the lips were an antenna. There is of
    course a mismatch between the characteristic
    impedance of the vocal tract and that of free
    air. While some vowels can be modeled with two
    tubes most will require three due to tongue
    constrictions and if we include the rounding of
    the lips a fourth short tube is needed. The
    impedance at the lips may be modeled as a (6kHz)
    high pass filter. There is similar radiation at
    the nostrils.

22
Practical models for speech analysis and synthesis
  • The tube models are good to study the
    relationship between the (stationary) sound
    produced and the articulators, but are not
    practical for continuous speech synthesis. Two
    models stand out as practical
  • The articulatory model which is based on the
    local tract shape, and
  • The terminal-analog model which is based
    primarily on the behavior of the output speech
    signal and only secondarily on articulation.

23
The Articulatory Model
  • This model considers the total vocal tract as a
    variable multiple (up to 12) connected lossless
    cylinders of different lengths and
    cross-sections. We approach the speech signal in
    this model from a time domain view of a wave
    travelling in the tube. The impedance mismatches
    that occur as the wave travels down the tubes,
    generate reflected waves travelling in the
    opposite direction. This can be analyzed by
    partial differential equations relating time and
    space as done in section 3.6.1 of the text. The
    transfer function can be obtained via a pole/zero
    analysis (polesformants, zeros only trivial one)
    as mentioned before here which can be modeled as
    a discrete time digital model with time delays as
    multiples of some unit of time t. The reflected
    waves are also in discrete time multiples of t.
    This model is most useful in coding the speech
    signal.

24
Terminal-Analog Model
  • In trying to practically and accurately reproduce
    or often code the speech signal based on a vocal
    tract transfer function H(z) and a radiation
    function R(z) excited by either a voiced or an
    unvoiced source (the mutual exclusion in this
    model is detrimental to voiced fricatives
    reproduction.) It uses the Linear Predictive
    Code (LPC) formula to predict the next output as
    suggested in this diagram from the text

25
Co-articulation
  • The articulatory motions of a phone are strongly
    influenced by the phone that precedes it and the
    phone that follows it, that is, its context.
    Co-articulation may extend across syllabic and
    syntactic boundaries. In general, a phones
    articulatory period exceeds its acoustic period.
    The result is that classical steady state
    positions for many phonemes are not often
    achieved in normal speech.

26
Some aspects of the TIMIT database
  • See (for specific details) http//www.ldc.upenn.ed
    u/Catalog/readme_files/timit.readme.html
  • The TIMIT corpus of read speech is designed to
    provide speech data for acoustic-phonetic studies
    and for the development and evaluation of
    automatic speech recognition systems. TIMIT
    contains broadband recordings of 630 speakers of
    eight major dialects of American English, each
    reading ten phonetically rich sentences. The
    TIMIT corpus includes time-aligned orthographic,
    phonetic and word transcriptions as well as a
    16-bit, 16kHz speech waveform file for each
    utterance. Corpus design was a joint effort among
    the Massachusetts Institute of Technology (MIT),
    SRI International (SRI) and Texas Instruments,
    Inc. (TI). The speech was recorded at TI,
    transcribed at MIT and verified and prepared for
    CD-ROM production by the National Institute of
    Standards and Technology (NIST). From the
    Linguistic Data Consortium (LDC) web site.
  • This corpus is extremely useful in assessing the
    speech training properties of learning systems
    for speech (such as HMMs and others).

27
Oscilloscope traces, spectrograms (of vowels) and
formants summary
100 ms segments
3kHz
Using the table of page 12, guess at the most
likely phonemes of The goo
28
Homework 1
  • Decompose a unit amplitude half sine wave at 500
    Hz in its first three Fourier components
    (fundamental and first two harmonics)
  • b) How does this relate to a glottal pulse in
    similarity? How is it different?
  • Problem P3.3 in the text, part a) only.
  • Relate the corresponding representation of
    phonemes in slides 16 and 17.
Write a Comment
User Comments (0)
About PowerShow.com