CS 551651: - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

CS 551651:

Description:

... Peterson, G.E., and Barney, H.L. (1952). ' Control methods ... Ratios of 1st and 2nd formant, from Miller (1989) based on. Peterson and Barney (1952) data: ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 46
Provided by: johnpau1
Category:
Tags: barney | miller

less

Transcript and Presenter's Notes

Title: CS 551651:


1
CS 551/651 Structure of Spoken Language Lecture
1 Visualization of the Speech Signal,Introductor
y Phonetics John-Paul Hosom Fall 2008
2
  • Visualization of the Speech Signal
  • Most common representations
  • Time-domain waveform
  • Energy
  • Pitch contour
  • Spectrogram (power spectrum)

3
Visualization of the Speech Signal Time-Domain
Waveform Time-domain waveform is a signal
recorded directly from microphone, with time on
horizontal axis and amplitude on vertical
axis. Variations in air pressure in the form of
sound waves move through the air somewhat like
ripples on a pond. A graph of a sound wave is
very similar to a graph of the movements of the
eardrum. Ladefoged, p. 184 Sound originates
from the motion or vibration of an object. This
motion is impressed upon the surrounding medium
(usually air) as a pattern of changes in
pressure. The sound generally weakens as it
moves away from the source and also may
be subject to reflections and refractions
Moore, p. 2
4
Visualization of the Speech Signal Time-Domain
Waveform
Vertical axis amplitude, relative sound
pressure typical unit ?Pa (micro-pascals)
(digital signal usually
unitless) quantization (-32768 to 32767)
Horizontal axis time typical unit msec
(milliseconds) sampling (8000, 16000, 44.1K
samp/sec)
5
Visualization of the Speech Signal Energy
Energy or Intensity intensity is sound
energy transmitted per second (power) through a
unit area in a sound field. Moore p.
9 intensity is proportional to the square of
the pressure variation Moore p.
9 normalized energy
intensity xn signal x at time sample n N
number of time samples
6
Visualization of the Speech Signal Energy
Energy or Intensity human auditory system
better suited to relative scales energy (bels)
energy (decibels, dB) I0 is a
reference intensity if the signal becomes twice
as powerful (I1/I0 2), then the energy level is
3 dB (3.01023 dB to be more precise) Typical
value for I0 is 20 ?Pa. 20 ?Pa is close to the
average human absolute threshold for a 1000-Hz
sinusoid.
7
Visualization of the Speech Signal Energy What
is a good value of N? Depends on information of
interest
N1 msec
N5 msec
N20 msec
N80 msec
8
Visualization of the Speech Signal Power
Spectrum What makes one phoneme, /aa/, sound
different from another phoneme, /iy/? Different
shapes of the vocal tract /aa/ is produced with
the tongue low and in the back of the mouth
/iy/ is produced with the tongue high and toward
the front. The different shapes of the vocal
tract produce different resonant frequencies,
or frequencies at which energy in the signal is
concentrated. (Simple example of resonant
energy a tuning fork may have resonant frequency
equal to 440 Hz or A). Resonant frequencies in
speech (or other sounds) can be displayed by
computing a power spectrum or spectrogram,
showing the energy in the signal at different
frequencies.
9
Visualization of the Speech Signal Power
Spectrum A time-domain signal can be expressed
in terms of sinusoids at a range of frequencies
using the Fourier transform where x(t) is
the time-domain signal at time t, f is a
frequency value from 0 to 1, and X(f) is the
spectral-domain representation. note One
useful property of the Fourier transform is that
it is time- invariant (actually, linear time
invariant). While a periodic signal x(t)
changes at t, t?, t2?, etc., the Fourier
transform of this signal is constant, making
analysis of periodic signals easier.
10
Visualization of the Speech Signal Power
Spectrum Since samples are obtained at discrete
time steps, and since only a finite section of
the signal is of interest, the discrete Fourier
transform is more useful in which x(k)
is the amplitude at time sample k, n is a
frequency value from 0 to N-1, N is the number of
samples or frequency points of interest, and
X(n) is the spectral-domain representation
of x(k). Note that we assume that that the
series outside the range (0, N-1) is extended
N-periodic, that is, xk xkN for all k.
11
  • Visualization of the Speech Signal Power
    Spectrum
  • The sampling frequency is the rate at which
    samples are recorded e.g. 8000 Hz 8000
    samples per second.
  • Shannons Sampling Theorem states that a
    continuous signal
  • must be discretely sampled with at least twice
    the frequency
  • of the highest frequency present in the signal.
    So, the signal
  • must not contain any data above Fsamp/2 (the
    Nyquist frequency). If it does, use a low-pass
    filter to remove these higher frequencies.
  • Because the signal is assumed to be periodic
    over length N, but this assumption is usually
    false, then the signal is weighted with a
    window so that both edges of the signal taper
    toward zero
  • Hamming window

12
Visualization of the Speech Signal Power
Spectrum The magnitude and phase of the
spectral representation are Phase
information is generally considered not important
in understanding speech, and the energy (or
power) of the magnitude of F(n) on the decibel
scale provides most relevant information Note
usually dont worry about reference intensity
I0 (assume a value of 1.0) the signal strength
(in ?Pa) is unknown anyway.
absolute value of complex number
13
Visualization of the Speech Signal Power
Spectrum The power spectrum can be plotted like
this (vowel /aa/)
time- domain amplitude
spectral power (dB) (512 samp)
73 dB
0 Hz
4000 Hz
frequency (Hz)
14
Visualization of the Speech Signal Power
Spectrum If the speech signal is periodic and
the number of samples in the window is large
enough, then harmonics are seen periodic
signal /aa/ periodic signal /aa/ aperiodic
signal /sh/ 128 samples 2048 samples 2048
samples (frequency range is 0 to 4000 Hz
in all plots) A harmonic is a strong energy
component at an integer multiple of the
fundamental frequency (pitch), F0.
15
Visualization of the Speech Signal Formants
Note that the resonant frequencies, or formants,
for the two vowels /aa/ and /iy/ can be
identified in the spectra. For recognition of
phonemes, the spectral envelope is important
(envelope shape of spectrum without harmonics)
?
envelope
/aa/ 2048 samples /iy/ 2048 samples
16
Visualization of the Speech Signal Formants
The harmonics, which are dependent on F0, are
not, in theory, significantly related to the
resonant frequencies, which are dependent on the
vocal tract shape (or phoneme)
/aa/ F080Hz
/aa/ F0164Hz
17
Visualization of the Speech Signal Formants
These formants can be modeled by a damped
sinusoid, which has the following
representations where S(f) is the spectrum
at frequency value f, A is overall amplitude, fc
is the center frequency of the damped sine wave,
and ? is a damping factor. Olive, p. 48, 58
center freq. fc
amplitude
0 dB
power (dB)
0
?
frequency (Hz)
time (msec)
18
Visualization of the Speech Signal Formants
The bandwidth is defined as the width of the
spectral peak measured at the point where the
linear spectral magnitude value is ½ the maximum
value. A reduction of the signal by a factor of
2 is equivalent to a 3 dB change.
3 dB
0 dB
power (dB)
bandwidth
frequency (Hz)
Also, the resonator must have a value of 0 dB at
0 Hz.
19
  • Visualization of the Speech Signal Formants
  • Formants are specified by a frequency, F, and
    bandwidth, B.
  • A neutral vowel (/ax/) theoretically has
    formants at 500 Hz, 1500 Hz, 2500 Hz, 3500 Hz,
    etc. The first formant is called F1, the
    second is called F2, etc. (The fundamental
    frequency, or pitch, is F0.)
  • F1, F2, and sometimes F3 are usually sufficient
    for identifying vowels.
  • Formants can be thought of as filters, which act
    on the source waveform. For vowels, the
    source waveform is air pushed through the
    vibrating vocal folds. Energy is lost (hence a
    damped sinusoid model) by sound absorption in
    the mouth.
  • A digital model of a formant can be implemented
    using an infinite-impulse response (IIR)
    filter.

20
Visualization of the Speech Signal
Excitation/Source The vocal-fold vibration
source looks like this (Note there are
some gross simplifications here well go
into more detail later in the course.) In
fricatives and other unvoiced speech, the source
is turbulent air
-6 dB/octave
power (dB)
amplitude
time (msec)
frequency (Hz)
flat slope
power (dB)
amplitude
time (msec)
frequency (Hz)
21
Visualization of the Speech Signal Pre-Emphasis
Because the source for voiced sounds decreases
at 6 dB/octave, a simple filter can be used to
increase the spectral tilt by 6 dB/octave,
thereby making voiced sounds spectrally flat and
easier to visualize. (NOTE unvoiced sounds then
have spectral slope of 6 dB/octave)
where x(n) is the time-domain speech signal at
sample number n, and x?(n) is the pre-emphasized
speech signal at sample n.
0 dB/octave
-6 dB/octave
power (dB)
frequency (Hz)
frequency (Hz)
22
Visualization of the Speech Signal Spectrograms
Many power spectra can be plotted over time,
creating a spectrogram or spectrograph
(pre-emphasis 0.97)
/aa/
freq (Hz) amp
(FFT size 10 msec)
freq (Hz) amp
/iy/
time (msec)
23
Visualization of the Speech Signal Spectrograms
The FFT window size has a large impact on visual
properties
/aa/
freq (Hz) amp
(FFT size 5 msec)
wideband small time window small FFT size
/aa/
freq (Hz)
(FFT size 33 msec)
narrowband large time window large FFT size
24
Spectrogram Reading Vowels 12 English vowels
(not all are phonemic), 8 or 9 phonemic vowels
/iy/ beet (front, high, unrounded,
tense) /ih/ bit (front, high, unrounded,
lax) /eh/ bet (front, mid, unrounded,
lax) /ae/ bat (front, low, unrounded, lax)
/ix/ roses (back, high, unrounded, lax)
(subst. /ih/) /ux/ suit (back, high, rounded,
lax) (subst. /uw/) /ax/ above (back/central,
mid, unrounded, lax) (subst. /ah/) /uw/ boot (
back, high, rounded, tense) /uh/ book (back,
high, rounded, lax) /ah/ above (back/central,
mid, unrounded, lax) /ao/ caught (back, low,
rounded, tense) (subst. /aa/) /aa/ father (back,
low, unrounded, tense) these vowels are
more centralized and shorter in duration
25
Spectrogram Reading Vowels 6 English
diphthongs /ey/ bay (front, mid?high,
unrounded, tense) /ay/ bye (back?front,
low?high, unrounded, tense) /oy/ boy (back?front,
mid?high, rounded?unrounded, tense) /yu/ beauty (f
ront?back, high, unrounded?rounded,
tense) /aw/ about (back, mid?high,
unrounded?rounded, tense) /ow/ boat (back, mid,
unrounded?rounded, tense)
26
Spectrogram Reading Vowels Vowel formant
frequencies (averages for English, males only)
from Ladefoged, p. 193
27
Spectrogram Reading Vowels Vowel formant
frequencies
28
Spectrogram Reading Vowels Vowel formants
(averages for English, male vs. female)
from Peterson, G.E., and Barney, H.L. (1952).
"Control methods used in the study of vowels",
Journal of the Acoustical Society of America,
24,175-184.
29
Spectrogram Reading Vowels Vowel formants,
Peterson and Barney data
30
Spectrogram Reading Vowels Ratios of 1st and
2nd formant, from Miller (1989) based on Peterson
and Barney (1952) data
31
Spectrogram Reading Diphthongs Diphthongs have
characteristic formant movements
/oy/
/ay/
/aw/
/yu/
/ow/
/ei/
32
Spectrogram Reading Vowels
33
Spectrogram Reading Vowels
34
Spectrogram Reading Vowels
35
Spectrogram Reading Vowels
36
Spectrogram Reading Vowels
37
Spectrogram Reading Vowels
38
Spectrogram Reading Vowels
39
Spectrogram Reading Vowels
40
Spectrogram Reading Diphthongs
41
Spectrogram Reading Diphthongs
42
Spectrogram Reading Diphthongs
43
Spectrogram Reading Diphthongs
44
Spectrogram Reading Diphthongs
45
Spectrogram Reading Diphthongs
Write a Comment
User Comments (0)
About PowerShow.com