Title: Class 2 September 5
1Class 2 (September 5)
- The mechanisms of speech production
- The vocal track, from the glottis to the lips and
nose. - The units of speech
- Linguistic representation of speech phonemes and
phones of the language. - Phones of the English language and their
classification (vowels and consonants, glides and
liquids, nasals, fricatives, plosives). Voiced
and unvoiced phones. - Place of articulation and mode of articulation.
- Acoustic phonetics.
- The tube model of the vocal tract and the wave
equation for resonances. - Radiation at the lips of the acoustic waves.
- Some aspects of the TIMIT database and others.
2Speech production the vocal track, from the
glottis to the lips and nose
- The vocal folds (there are no chords) is where
the fundamental component of the speech wave
originates (F0), which is perceived as pitch by
the human ear (a perceptual measure) - It originates under air pressure from the lungs
and tension (folding) of the vocal folds located
in the glottis it is called the glottal pulse
(notice that it is not sinusoidal!) - The vibrations of the main component are
modulated via the articulators (place of
articulation) in the rest of the vocal tract
(lips/ and sometimes teeth, tongue, nose/ by
closing the lips, alveolus/the ridge behind and
above the front teeth, palate/the roof of the
mouth, velum/the soft palate, or combinations) - As the pressure wave F0 travels down the (soft
and nonlinear) vocal tract (voicing) and is
articulated resonances occur originating additive
multiples of the fundamental and its harmonics
for each of the sounds - The sum of the fundamental and its harmonics give
rise to a non-uniform frequency/amplitude
spectrum which peaks characteristically at
increasing frequencies in the formants F1, F2, F3
and F4 - There are sounds in which the vocal fold
excitation does not occur and they are called
unvoiced sounds (some fricatives, affricates and
stops)
3Oscilloscope trace of the goo in the time
domain ( 100 ms for each of the three parts)
the
A slight variation in F0?
g
oo
4SIX VIEWS OF THE LARYNX DURING A VOICING CYCLE
Glottis and Vocal Folds
Ventricular or false folds
Vocal Folds
Glottis (the aperture)
Epiglottis
The glottis is here
GLOTTAL PULSES
5My Glottis and Vocal Folds
6Articulators
- Tongue
- Vocal folds (excitation others filter)
- Lips
- Teeth
- Velum
- Jaw
7On one aspect of speech analysis
- The appearance of a spectrogram of a sampled
speech signal s(n) depends significantly on the
window width used in the sampling (at 8KHz
sampling we sample every 0.125 ms). Typically
used windows (we will see different types later)
have about 10-20 ms width. However, we can trade
frequency for time resolution. Compare the
wideband (short window 300 Hz, 10 ms) with the
narrowband (45 Hz, 20 ms) spectrograms.
8glottis
When we let the glottis vibrate (as in a
sustained vowel sound) without changing F0
(perceived as pitch) we get a spectrum (an
amplitude / frequency distribution) at the
glottis similar to the following diagram since
there are multiples of F0 (harmonics) in the
glottal pulses. After those frequencies go
through the vocal tract (above the glottis) they
get filtered by H(t) with resonances giving rise
to F1, F2 and F3.
H(t) Vocal tract filter function
Output from the lips
9Spectrograms (3D) of voiced phones and formants
(2D) of some vowels (unrestricted air flow,
unlike consonants)
Identify F1, F2, F3 and, what happened to F0?
What are the resonances (poles) and the
anti-resonances (zeroes)? How do they originate?
F1
F3
F2
10Spectrograms of English diphthongs
11Units of Speech
- Speech is a continuous wave of air pressure that
is visualize with a microphone (plus a suitable
amplifier) and an oscilloscope. It is best
analyzed by breaking a sentence into smaller
units. We must consider the continuity of the
units coarticulation (transition from one unit to
another) - Desirable characteristics of a speech unit for
the purpose of recognition (Kai-Fu Lees SPHINX)
1) Sensitivity (accounting for co-articulatory
effects) 2) Trainability (not requiring an
unreasonable number of training samples 3)
Sharability (can be shared among larger units of
speech) - Using too general units of speech increases
sharability at the cost of insensitivity
12Possible units of speech for HMM recognition
- Words (particularly good in small vocabularies
however for a 20,000 word vocabulary would
require some 400,000 training examples and 1 GB
of memory) - Syllables consists of a nucleus (vowel or
diphthong) plus surrounding consonants (there are
some 20,000 syllables in English) - Demisyllables are syllables cut in half at the
middle of the vowel (no co-articulation). Its
number is reduced to 1,000 - Triphones or context dependent phones three
phones (see below with the middle one being
modified) Fewer than 503 - why? However, there
are problems of excessive memory and large number
for natural English. In a 1000 word vocabulary
(SPHINX) there were some 2,381 triphone contexts
using 24 MB. They had to be clustered. - Diphones fewer than 502 - why? If beginning and
end transitions are matched they can be used for
speech synthesis - Phones see discussion that follows on phones and
phonemes - 1/60 second window has been used successfully in
audio-visual speech driven simulation
13Desirable characteristics of a speech unit
- Sensitivity accounts for co-articulatory effects
- Trainability is important when we consider the
size of the training set (abundance of the unit) - Sharability when the unit is present in
different samples of the training set it is more
abundant and easier to train
14Phones and phonemes
- If we can imagine segments of speech points
plotted in a highly dimensional space and
clustered according to their perceived
differentiability along a good number of
features, we can then
Call the (red) centroid of a differentiable
cluster a phoneme and the points around it we
call allophones or just phones. A phoneme is an
abstract linguistic unit and the smallest
contrastive unit in the language. Phones are
specific instances of phonemes which vary because
of place of articulation, co-articulation or
individual variations of speakers, etc
Disclaimer this is an idealized picture
and usually there is no such clear
differentiation or contrast.
15Phonemes as defined by the IPA
- There are some 40 (some say 50) phonemes in the
English language which may be classified
(according to the intervention of glottal pulses)
in voiced and unvoiced, according to the manner
of articulation (vowels -13, diphthongs 3,
glides -2, liquids -2, nasals - 3 , fricatives
-9, stops -6, affricates 2). The place of
articulation mentioned is the most common for the
centroid and not unique for a phone corresponding
to a given phoneme. See and study the next table.
16Acoustics phonetics
(IPA)
bought
13 vowels 3 diphthongs
17Because the IPA symbols are typographically
demanding, the SPHINX system used the following
36 more easily representable symbols to refer to
the phonemes used.
See if you can identify them in the chart of the
previous page
18 A model of the vocal tract
- The production of speech may be modeled with two
excitation sources and a transfer function (W is
the angular frequency and l the length of the
tube under consideration), as shown
Periodic wave generator in for voiced phonemes
Input UG ( W )
Output U (l, W)
Random noise generator in for unvoiced phonemes
19Implementation of the vocal tract transfer
function model
- It is interesting to study the vocal tract Xfer
function models, but they are good only for
either one or a group of phonemes and do not
apply to continuous speech (a sequence of a
co-articulated set of phones) unless they are
morphed one into another as needed. The text has
two models one is mechanical based on the
acoustics of air flow in pipes and the other is
electrical in the form of discrete transmission
lines, but neither is applicable to all phonemes.
We will study the former.
20/e/
Rigid straight uniform tube model ¼ wavelength
(approximates the schwa)
- The vocal tract transfer function as derived in
section 3.5.2.2 of OShaughnessy is
where W is the radian frequency, l length of
the tube and clf speed of sound (340 m/s).
Assuming a length of 17 cm. where are the poles
of this transfer function and how do they show in
the output at what frequencies? (Questions like
this may show up in exams!) Other phonemes may be
simulated by concatenating tubes of different
diameters as the constrictions affect the formant
location
¼ wavewhy?
½ wave
What would we have to do to the tube model to
produce continuous speech? How the straight model
violates the actual shape of the vocal tract?
21Radiation of the acoustic waves at the lips
- Inside the vocal tract (inside the model tubes)
we assume that the wavelength of the acoustic
wave (depending on frequency, of course) is
several times larger than the diameter of the
vocal tract (2 cm) and that the wave is a flat
pressure wave. What is the maximum frequency for
which you could use this assumption? However, at
the lips (which act as a radiating antenna) the
situation is different and we have spherical
pressure waves propagating almost isotropically
as if the lips were an antenna. There is of
course a mismatch between the characteristic
impedance of the vocal tract and that of free
air. While some vowels can be modeled with two
tubes most will require three due to tongue
constrictions and if we include the rounding of
the lips a fourth short tube is needed. The
impedance at the lips may be modeled as a (6kHz)
high pass filter. There is similar radiation at
the nostrils.
22Practical models for speech analysis and synthesis
- The tube models are good to study the
relationship between the (stationary) sound
produced and the articulators, but are not
practical for continuous speech synthesis. Two
models stand out as practical - The articulatory model which is based on the
local tract shape, and - The terminal-analog model which is based
primarily on the behavior of the output speech
signal and only secondarily on articulation.
23The Articulatory Model
- This model considers the total vocal tract as a
variable multiple (up to 12) connected lossless
cylinders of different lengths and
cross-sections. We approach the speech signal in
this model from a time domain view of a wave
travelling in the tube. The impedance mismatches
that occur as the wave travels down the tubes,
generate reflected waves travelling in the
opposite direction. This can be analyzed by
partial differential equations relating time and
space as done in section 3.6.1 of the text. The
transfer function can be obtained via a pole/zero
analysis (polesformants, zeros only trivial one)
as mentioned before here which can be modeled as
a discrete time digital model with time delays as
multiples of some unit of time t. The reflected
waves are also in discrete time multiples of t.
This model is most useful in coding the speech
signal.
24Terminal-Analog Model
- In trying to practically and accurately reproduce
or often code the speech signal based on a vocal
tract transfer function H(z) and a radiation
function R(z) excited by either a voiced or an
unvoiced source (the mutual exclusion in this
model is detrimental to voiced fricatives
reproduction.) It uses the Linear Predictive
Code (LPC) formula to predict the next output as
suggested in this diagram from the text
25Co-articulation
- The articulatory motions of a phone are strongly
influenced by the phone that precedes it and the
phone that follows it, that is, its context.
Co-articulation may extend across syllabic and
syntactic boundaries. In general, a phones
articulatory period exceeds its acoustic period.
The result is that classical steady state
positions for many phonemes are not often
achieved in normal speech.
26Some aspects of the TIMIT database
- See (for specific details) http//www.ldc.upenn.ed
u/Catalog/readme_files/timit.readme.html - The TIMIT corpus of read speech is designed to
provide speech data for acoustic-phonetic studies
and for the development and evaluation of
automatic speech recognition systems. TIMIT
contains broadband recordings of 630 speakers of
eight major dialects of American English, each
reading ten phonetically rich sentences. The
TIMIT corpus includes time-aligned orthographic,
phonetic and word transcriptions as well as a
16-bit, 16kHz speech waveform file for each
utterance. Corpus design was a joint effort among
the Massachusetts Institute of Technology (MIT),
SRI International (SRI) and Texas Instruments,
Inc. (TI). The speech was recorded at TI,
transcribed at MIT and verified and prepared for
CD-ROM production by the National Institute of
Standards and Technology (NIST). From the
Linguistic Data Consortium (LDC) web site. - This corpus is extremely useful in assessing the
speech training properties of learning systems
for speech (such as HMMs and others).
27Oscilloscope traces, spectrograms (of vowels) and
formants summary
100 ms segments
3kHz
Using the table of page 12, guess at the most
likely phonemes of The goo
28Homework 1
- Decompose a unit amplitude half sine wave at 500
Hz in its first three Fourier components
(fundamental and first two harmonics) - b) How does this relate to a glottal pulse in
similarity? How is it different? - Problem P3.3 in the text, part a) only.
- Relate the corresponding representation of
phonemes in slides 16 and 17.