Title: Acoustics of Speech
1Acoustics of Speech
2Goal 1 Distinguishing One Phoneme from Another,
Automatically
- ASR Did the caller say I want to fly to Newark
or I want to fly to New York? - Forensic Linguistics Did the accused say Kill
him or Bill him? - What evidence is there in the speech signal?
- How accurately and reliably can we extract it?
3Goal 2 Determining How things are said is
sometimes critical to understanding
- Forensic Linguistics Kill him! or Kill him?
- Call Center That amount is incorrect.
- What information do we need to extract from the
speech signal? - What tools do we have to do this?
4Today and Next Class
- Acoustic features to extract
- Fundamental frequency (pitch)
- Amplitude/energy (loudness)
- Spectrum
- Timing (pauses, rate)
- Tools for extraction
- Praat
- Wavesurfer
- Xwaves
5Sound Production
- Pressure fluctuations in the air caused by a
musical instrument, a car horn, a voice - Cause eardrum to move
- Auditory system translates into neural impulses
- Brain interprets as sound
- Plot sound as change in air pressure over time
- From a speech-centric point of view, when sound
is not produced by the human voice, we may term
it noise - Ratio of speech-generated sound to other
simultaneous sound signal-to-noise ratio - Higher SNRs are better
6How Loud are Common Sounds How Much Pressure
Generated?
- Event Pressure (Pa) Db
- Absolute 20 0
- Whisper 200 20
- Quiet office 2K 40
- Conversation 20K 60
- Bus 200K 80
- Subway 2M 100
- Thunder 20M 120
- DAMAGE 200M 140
7Some Sounds are Periodic
- Simple Periodic Waves (sine waves) defined by
- Frequency how often does pattern repeat per time
unit - Cycle one repetition
- Period duration of cycle
- Frequency cycles per time unit, e.g.
- Frequency in Hz cycles per second or 1/period
- E.g. 400Hz pitch 1/.0025 (1 cycle has a period
of .0025 400 cycles complete in 1 sec) - Amplitude peak deviation of pressure from normal
atmospheric pressure
8- Phase timing of waveform relative to a reference
point
9(No Transcript)
10Complex Periodic Waves
- Cyclic but composed of multiple sine waves
- Fundamental frequency (F0) rate at which largest
pattern repeats (also GCD of component freqs) - Components not always easily identifiable power
spectrum graphs amplitude vs. frequency - Any complex waveform can be analyzed into a set
of sine waves with their own frequencies,
amplitudes, and phases (Fouriers theorem)
11(No Transcript)
12(No Transcript)
13Some Sounds are Aperiodic
- Waveforms with random or non-repeating patterns
- Random aperiodic waveforms white noise
- Flat spectrum equal amplitude for all frequency
components - Transients sudden bursts of pressure (clicks,
pops, door slams) - Waveform shows a single impulse (click.wav)
- Fourier analysis shows a flat spectrum
- Some speech sounds, e.g. many consonants (e.g.
cat.wav)
14Speech Production
- Voiced and voiceless sounds
- Vocal fold vibration filtered by the Vocal tract
produces complex periodic waveform - Cycles per sec of lowest frequency component of
signal fundamental frequency (F0) - Fourier analysis yields power spectrum with
component frequencies and amplitudes - F0 is first (lowest frequency) peak
- Harmonics are resonances of component frequencies
amplified by vocal track
15Vocal fold vibration
UCLA Phonetics Lab demo
16Places of articulation
http//www.chass.utoronto.ca/danhall/phonetics/sa
mmy.html
17How do we capture speech for analysis?
- Recording conditions
- A quiet office, a sound booth, an anachoic
chamber - Microphones
- Analog devices (e.g. tape recorders) store and
analyze continuous air pressure variations
(speech) as a continuous signal - Digital devices (e.g. computers,DAT) first
convert continuous signals into discrete signals
(A-to-D conversion)
18- File format
- .wav, .aiff, .ds, .au, .sph,
- Conversion programs, e.g. sox
- Storage
- Function of how much information we store about
speech in digitization - Higher quality, closer to original
- More space (1000s of hours of speech take up a
lot of space)
19Sampling
- Sampling rate how often do we need to sample?
- At least 2 samples per cycle to capture
periodicity of a waveform component at a given
frequency - 100 Hz waveform needs 200 samples per sec
- Nyquist frequency highest-frequency component
captured with a given sampling rate (half the
sampling rate)
20Sampling/storage tradeoff
- Human hearing 20K top frequency
- Do we really need to store 40K samples per second
of speech? - Telephone speech 300-4K Hz (8K sampling)
- But some speech sounds (e.g. fricatives, /f/,
/s/, /p/, /t/, /d/) have energy above 4K! - Peter/teeter/Dieter
- 44k (CD quality audio) vs.16-22K (usually good
enough to study pitch, amplitude, duration, )
21Sampling Errors
- Aliasing
- Signals frequency higher than half the sampling
rate - Solutions
- Increase the sampling rate
- Filter out frequencies above half the sampling
rate (anti-aliasing filter)
22Quantization
- Measuring the amplitude at sampling points what
resolution to choose? - Integer representation
- 8, 12 or 16 bits per sample
- Noise due to quantization steps avoided by higher
resolution -- but requires more storage - How many different amplitude levels do we need to
distinguish? - Choice depends on data and application (44K 16bit
stereo requires 10Mb storage)
23- But clipping occurs when input volume is greater
than range representable in digitized waveform - Increase the resolution
- Decrease the amplitude
24What can we do if our data is noisy?
- Acoustic filters block out certain frequencies of
sounds - Low-pass filter blocks high frequency components
of a waveform - High-pass filter blocks low frequencies
- Reject band (what to block) vs. pass band (what
to let through) - But if frequencies of two sounds overlap.source
separation
25How can we capture pitch contours, pitch range?
- What is the pitch contour of this utterance? Is
the pitch range of X greater than that of Y? - Pitch tracking Estimate F0 over time as fn of
vocal fold vibration - A periodic waveform is correlated with itself
- One period looks much like another (cat.wav)
- Find the period by finding the lag (offset)
between two windows on the signal for which the
correlation of the windows is highest - Lag duration (T) is 1 period of waveform
- Inverse is F0 (1/T)
26- Errors to watch for
- Halving shortest lag calculated is too long
(underestimate pitch) - Doubling shortest lag too short (overestimate
pitch) - Microprosody effects (e.g. /v/)
27Sample Analysis File Pitch Track Header
- version 1
- type_code 4
- frequency 12000.000000
- samples 160768
- start_time 0.000000
- end_time 13.397333
- bandwidth 6000.000000
- dimensions 1
- maximum 9660.000000
- minimum -17384.000000
- time Sat Nov 2 155550 1991
- operation record padding xxxxxxxxxxxx
28Sample Analysis File Pitch Track Data
- (F0 Pvoicing Energy A/C Score)
- 147.896 1 2154.07 0.902643
- 140.894 1 1544.93 0.967008
- 138.05 1 1080.55 0.92588
- 130.399 1 745.262 0.595265
- 0 0 567.153 0.504029
- 0 0 638.037 0.222939
- 0 0 670.936 0.370024
- 0 0 790.751 0.357141
- 141.215 1 1281.1 0.904345
29Pitch Perception
- But do pitch trackers capture what humans
perceive? - Auditory systems perception of pitch is
non-linear - Sounds at lower frequencies with same difference
in absolute frequency sound more different than
those at higher frequencies (male vs. female
speech) - Bark scale (Zwicker) and other models of
perceived difference
30How do we capture loudness/intensity?
- Is one utterance louder than another?
- Energy closely correlated experimentally with
perceived loudness - For each window, square the amplitude values of
the samples, take their mean, and take the root
of that mean (RMS energy) - What size window?
- Longer windows produce smoother amplitude traces
but miss sudden acoustic events
31Perception of Loudness
- But the relation is non-linear sones or decibels
(dB) - Differences in soft sounds more salient than loud
- Intensity proportional to square of amplitude
sointensity of sound with pressure x vs.
reference sound with pressure r x2/r2 - bel base 10 log of ratio
- decibel 10 bels
- dB 10log10 (x2/r2)
- Absolute (20 ?Pa, lowest audible pressure
fluctuation of 1000 Hz tone), typical threshold
level for tone at frequency
32How do we capture.
- For utterances X and Y
- Pitch contour Same or different?
- Pitch range Is X larger than Y?
- Duration Is utterance X longer than utterance
Y? - Speaker rate Is the speaker of X speaking
faster than the speaker of Y? - Voice quality.
33Next Class
- Tools for the Masses Read the Praat tutorial
- Download Praat from the course syllabus page and
play with a speech file (e.g. http//www.cs.columb
ia.edu/julia/cs4706/cc_001_sadness_1669.04_August
-second-.wav or record your own) - Bring a laptop and headphones to class if you
have them