Title: Conversion
1Conversion Issues in Hearing, Sampling,
Quantization, and Implementation
- James D. Johnston
- Microsoft Corporation
- Audio Architect
2Basic Hearing Issues
- Parts of the ear.
- Their contribution to the hearing process.
- What is loudness?
- What is intensity?
- How loud can you hear?
- How quiet can you hear?
- How high is high?
- How low is low?
- What do two ears do for us?
3Fundamental Divisions of the Ear
- Outer ear
- Head and shoulders
- Pinna
- Ear Canal
- Middle Ear
- Eardrum
- 3 bones and spaces
- Inner ear
- Cochlea
- Organ of Corti
- Basilar Membrane
- Tectoral Membrane
- Inner Hair Cells
- Outer Hair Cells
4Functions of the Outer Ear
- In a word, HRTFs. HRTF means head related
transfer functions, which are defined as the
transfer functions of the body, head, and pinna
as a function of direction. Sometimes people
refer to HRIRs, or head related impulse
responses, which are the same information
expressed in the time, as opposed to frequency,
domain. - HRTFs provide information on directionality
above and beyond that of binaural hearing. - HRTFs also provide disambiguation of front/back,
up/down sensations along the cone of confusion. - The ear canal resonates somewhere between 1 and 4
kHz, resulting in some increased sensitivity at
the point of resonance. This resonance is about 1
octave wide, give or take.
5Middle Ear
- The middle ear acts primarily as a highpass
filter (at about 700 Hz) followed by an
impedance-matching mechanical transformer. - The middle ear is affected by muscle activity,
and can also provide some level clamping and
protection at high levels. - You dont want to be exposed to sound at that
kind of level.
6Inner Ear (Cochlea)
- In addition to the balance mechanism, the inner
ear is where most sound is transduced into neural
information. - The inner ear is a mechanical filterbank,
implementing a filter whose center-frequency
tuning goes from high to low as one goes farther
into the cochlea. - The bandpass nature is actually due to coupled
tuning of two highpass filters, along with
detectors (inner hair cells) that detect the
difference between the two highpass (HP) filters.
7Some Web Views of the Basilar Membrane and
Cochlear Mechanics.
- http//www.medizin.fu-berlin.de/klinphys/topics/ea
r2_anatomy_e.htm - http//www.enchantedlearning.com/subjects/anatomy/
ear/
Thanks to Brian Weisbrod for sending me a
phenomenal list of these resources.
8One point along the Membranes
Tectoral Membrane
Outer Hair Cells
Inner Hair Cell
Basilar Membrane
9Example HP filter(This filter is synthetic, NOT
real)
10Features of a HP filter
- At the frequency where the amplitude is greatest,
the phase is changing rapidly. - This means that two filters, slightly offset in
frequency, will show a large difference between
the two center frequencies, providing a very big
difference in that region. - When two nearly-identical filters are coupled,
their resonances split into two peaks, slightly
offset in frequency. - As the coupling decreases, the two resonances
move back toward the same point. - The ear takes the difference between two filters.
11Filter split vs. Frequency Response
Offset 1.1
Offset 1.00001
Offset 1.001
Offset 1.000001
12Ok, what couples these two masses?
- The outer hair cells couple the basilar and
tectoral membranes. - At high levels, the outer hair cells are
depolarized by a feedback from the inner hair
cells. Depolarized cells are considerably less
stiff than polarized cells. - At low levels, the outer hair cells remain
(mostly) polarized. This maximizes the coupling,
and therefore the system sensitivity.
13But, but Inner hair cells
- There are calcium channels at the base of the
hairs (really more like villi in biochemical
terms) that open and close as the hairs are
flexed sideways. - A very VERY small amount of flexure can therefore
trigger an inner hair cell. - The inner hair cells are the Detectors.
- At low frequencies, they detect the leading edge
(membranes moving closer to each other) of the
waveform. (Up to 500 Hz). - At high frequencies, they detect the leading edge
of the envelope. (Above 2000Hz or so). - At frequencies between 500Hz and 2kHz or so, the
detection is mixed. - The higher the level of motion on the basilar
membrane, the more often the nerve cells fire
after the leading edge.
14An example of Compression via tuning.
This is not data from a human subject. From Dr.
J. B. Allen
15A set of filters at different points along the
basilar membrane.
From Dr. J. B. Allen
16Critical Bandwidths
- The bandwidth of a filter is referred to as the
Critical Band or Equivalent Rectangular
Bandwidth (ERB). - ERBs and Critical Bands (measured in units of
Barks, after Barkhausen) are reported as
slightly different. - ERBs are narrower at all frequencies.
- ERBs are probably closer to the right
bandwidths, note the narrowing of the filters on
the Bark scale in the previous slide at high
Barks (i.e. high frequencies). - I will use the term Critical Band in this talk,
by habit. None the less, I encourage the use of a
decent ERB scale. - Bear in mind that both Critical Band(widths) and
ERBs are useful, valid measures, and that you
may wish to use one or the other, depending on
your task. - There is no established ERB scale to date,
rather researchers disagree quite strongly,
especially at low frequencies. It is likely that
leading-edge effects as well as filter bandwidths
lead to these differences. The physics suggests
that the lowest critical bands or ERBs are not
as narrow as the literature suggests.
17What are the main points?
- The cochlea functions as a mechanical
time/frequency analyzer. - Center frequency changes as a function of the
distance from the entrance end of the cochlea.
High frequencies are closest to the entrance. - At higher center frequencies, the filters are
roughly a constant fraction of an octave
bandwidth. - At lower center frequencies, the filters are
close to uniform bandwidth. - The filter bandwidth, and therefore the filter
time response length varies by a factor of about
401.
18What happens as a function of Level?
- As level rises, the ear desensitizes itself by
many dB. - As level rises, the filter responses in the ear
shift slightly in position vs. frequency. - The ear, with a basic 30dB SNR (1000.5) in the
detector, can range over at least 120dB of level.
19What does this mean?
- The internal experience, called Loudness, is a
highly nonlinear function of level, spectrum, and
signal timing. - The external measurement in the atmosphere,
called Intensity, behaves according to physics,
and is somewhat close to linear. - The moral? Theres nothing linear involved.
20Some points on the SPL Scale
(x is very loud pipe organ, o is threshold of
pain/damage, is moderate barometric pressure
change
21Edge effects and the Eardrum
- The eardrums HP filter desensitizes the ear
below 700Hz or so. The exact frequency varies by
individual. This means that we are not deafened
by the loudness of weather patterns, for
instance. - At both ends of the cochlea, edge effects lessen
the compression mechanisms in the ear. - The results of these two effects, coupled with
the ears compression characteristics, results in
the kind of response shown in the next slide.
22Fletcher and Munsons famous equal loudness
curves.
The best one-picture summary of hearing in
existence.
23Whats quiet, and whats loud?
- As seen from the previous graph, the ear can hear
to something below 0dB SPL at the ear canal
resonance. - As we approach 120dB, the filter responses of the
ear start to broaden, and precise frequency
analysis becomes difficult. As we approach 120dB
SPL, we also approach the level at which
near-instantaneous injury to the cochlea occurs. - Air is made of discrete molecules. As a result,
the noise floor of the atmosphere at STP
approximates 6dB SPL white noise in the range of
20Hz-20kHz. This noise may JUST be audible at the
point of ear canal resonance. Remember that the
audibility of such noise must be calculated
inside of an ERB or critical band, not broadband.
24So, whats high and whats low in frequency?
- First, the ear is increasingly insensitive to low
frequencies, as shown in the Fletcher curve set.
This is due to both basilar membrane and eardrum
effects. 20Hz is usually mentioned as the lowest
frequency detected by the hearing apparatus. Low
frequencies at high levels are easily perceived
by skin, chest, and abdominal areas as well as
the hearing apparatus.
25- At higher frequencies, all of the detection
ability above 15-16 kHz lies in the very first
small section of the basilar membrane. While some
young folks have been said to show supra-20kHz
hearing ability (and this is likely true due to
their smaller ear, ear canal, and lack of
exposure damage), in the modern world, this first
section of the basilar membrane appears to be
damaged very quickly by environmental noise. - At very high levels, high (and ultrasonic)
signals can be perceived by the skin. You
probably dont want to be exposed to that kind of
level.
26What about binaural issues?
- Binaurally, with broadband signals, we can
distinguish 10 microsecond shifts in left vs.
right stimulii of the right characteristics.
While this has implications in block-processed
algorithms with pre-echo, it does not generally
relate substantially to ADC and DAC hardware that
is properly clocked.
27The results?
- For presentation (NOT capture, certainly not
processing), a range of 6dB SPL (flat noise
floor) to 120dB (maximum you should hear, also
maximum most systems can achieve) should be
sufficient. This is about 19 bits. - An input signal range of 20Hz to 20kHz is
probably enough, there are, however, some filter
issues that will be raised later, that may affect
your choice of sampling frequency.
28Sampling andQuantization
29Sampling and Quantization
- Continuous domain vs. sampled domain.
- Sampling
- Aliasing
- Discrete level (quantized) vs. noisy continuous
domain - Quantization
- Dithering
- Time/frequency duality
- FFTs (DFTs too)
30What do Analog and Digital really mean?
- The domain we commonly refer to as analog is a
time-continuous (at least to mortal eyes) domain,
with continuous level resolution limited by
physical properties of material. The level
resolution and time resolution are never exact
due to basic physics. - The digital domain is a sampled, quantized
domain. That means that we only know the value of
the signal at specified time, and that the level
of the signal occupies one of a set of discrete
levels. The set of levels, and the times that
the signal has a value, however, are exact in the
digital framework. (Although there may, of
course, be errors in acquisition.)
31Properties of the analog domain
- The time domain is continuous. This means that
any frequency limits come from physical
processes, not from mathematical restrictions.
However, physics places some very strong
constraints on such signals - All signals have finite energy
- All signals have finite bandwidth
- All signals have finite duration
- All signals have a finite noise floor
- All four of the points above are very important!
32A reminder about Duality in the Fourier domain
- Multiplying two signals means that you convolve
their (full, complex) spectra. - Convolving two signals means that you multiply
their (full, complex) spectra. - These two properties of Fourier Analysis (other
commonly used transforms obey them as well) are
very important. Remember them even if you dont
know anything about Fourier Analysis.
33Fourier Domain Properties
- Please remember the properties. I dont want or
expect anyone to understand all the details, but
please remember the PROPERTIES. - Fourier analysis is valid on all finite energy,
finite-bandwidth signals. That describes all
real-world audio signals that we care to deal
with. (The only counterexamples occur in
astrophysics and particle physics, neither of
which a listener can be comfortably seated near.)
34So, lets sample that analog signal.
- Sampling means capturing the value of the signal
at a periodic rate. - This means that we MULTIPLY the signal by the
specific impulse at a regular interval. - Quite obviously, thats not what actual hardware
does. Most use a track/hold, or other capture
method. The result, in the sampled domain, is the
same. - That means that we CONVOLVE the signal spectrum
and the sampling spectrum. - Hence, we have the Nyquist criterion, later
proven by Shannon.
35A Graphical Example
Top to bottom Sampling train, spectrum of
sampling train, Sine wave below half the sampling
rate and resulting samples, their spectra, sine
wave as far above half the sampling rate and
resulting samples, their spectra
36What would I hear if there was a demo of aliasing?
- Aliasing and imaging (imaging, as we will see
shortly, is the reconstruction version of
aliasing) sound awful. Aliasing in general is
anharmonic, and remarkably annoying. - Filtering is a requirement. Its not an option.
- The presence of a filter has consequences that we
will discuss later. - This leads us to the sampling theorem.
37Hence, the Sampling Theorem
- We must limit the bandwidth of the signal to
fs/2, where fs is the sampling frequency. - While this does not mean dc to fs/2, thats what
we do in audio, since we want signals close to
dc. (There are other sampling methods that sample
other regions of frequency.) - This means that we must band limit the signal
into the sampler. An anti-alias filter is not
just a good idea, its mathematically necessary.
38Consequences of the Sampling Theorem
- We must band limit the signal in order to avoid
aliasing. - Any out-of-band signals will alias back into the
base band. - That has consequences far beyond the initial
sampling of the material. Well get to that later
when we talk about things like clipping and
nonlinearities, or jitter.
39Ok, but we represent those samples as binary
values, right?
Yes, we do. Thats called quantization. Thats
the other necessary process in digitization of a
signal. Quantization is why digital signals can
be saved and re-saved without degradation. Its
also why digital PCM signals have a fixed,
unchanging noise floor. (There are other
possiblities, well talk about those later.)
40So, quantization is like rounding, right? Well,
Lets see!
Using rounding only
Original, quantized, error, and spectrum of
original and error.
41To drive the point even farther home
Thats right, we have to dither quantizers. Its
not just a good idea.
42Dither? Whats that?
- As the spectra (and error waveform of the second
slide) show, the error of an undithered quantizer
is highly correlated to the original signal. - Dither consists of adding some random function
BEFORE the quantization so that the noise is
decorrelated. - The first kind of dither people tried was called
uniform. Lets see how that works out.
43Uniform - ½ step-size dither.
Not bad, but notice the noise level coming and
going around the zero crossing?
44Hence, Triangular PDF Dither
Now, the noise stays constant over all amplitudes.
45To recap
Notice, even in this very mild case, where
harmonics do not alias over each other, in
addition to eliminating tonal components and
preserving information, dither RAISES the noise
floor, and lowers the PERCIEVED noise floor.
46To summarize
- A digital signal is sampled and quantized.
- Sampling requires anti-aliasing filters.
- Quantization requires TPD.
- Dither and Anti-aliasing are not options!
47What about reconstruction?
- Yes, that convolution theorem applies again, this
time usually convolving a square pulse with the
digital signal. - This leads to a form of signal images. While
images and aliases come about by
mathematically similar processes, people persist
in having different names for them. - Some (many in the high end) omit the anti-imaging
filter, and imagine that there is a beating
problem. If you havent heard this yet, you will
at some point. The next few graphs show why it
isnt so.
48Basically, the same thing happens.
Top Line Sine wave Below fs/2 Next line Sine
wave plus first alias pair (blue) and just the
aliases (red) Each line adds another pair, except
the last, which adds the first 100 pairs. The
gain of the red waveform is greatly increased in
order to make it visible. Notice that after 100
alias pairs are added in the original waveform
has the familiar stairstep
49Notice, at the bottom, the beating that some
audio enthusiasts complain about. Notice that
beating only happens when aliases are added.
50Low frequency reconstruction example.
51Elements of reconstruction
- In reconstruction, filtering is also necessary to
remove the image signals that originate in the
same fashion as aliases arise in sampling. - In reconstruction, the waveform is sometimes a
step rather than an impulse, so other
compensation is sometimes necessary to get a flat
frequency response. Why? - Again, using the step in time (convolving)
means that you multiply the signal by the
frequency response of the step in the frequency
domain, leading to a rolloff like sin(x)/x. - This rolloff can be as much as -3.92dB at fs/2,
and can cause audible rolloff if not corrected
somehow. - Modern convertors of the delta-sigma variety do
not use a step at the final sampling frequency
(although they certainly use a step its at a
much higher frequency). Their design, however,
introduces other issues. That discussion comes
later.
52How to Build Convertors
- A quick survey of methods. Much more information
will be forthcoming this afternoon.
53Baseband Conversion
A to D Convertor
Filter
Audio Input
Sampling Clock
- This is the basic block diagram for any PCM
convertor. - In this convertor, the filter is outside the
convertor, and the quantizer is part of the
sampling mechanism. - This method is not very common any more, but we
will discuss its properties before moving on to
oversampling convertors.
54Spectrum of Signal and Noise
Original
Spectrum of Original
Quantized and Dithered
Spectrum of Quantized, Dithered Signal Red 8
bits, green 9 bits
(note, to make scaling easier, I will use 7/8/9
bit quantization)
55- Things to notice.
- The noise floor is flat. If you sum up all of the
energy in the noise floor, you will wind up with
the SNR you expect. Notice that practically all
of the noise is IN BAND when there is no
oversampling. - Its really hard to see quantization at even very
noisy levels in a waveform plot. - Each bit of quantization is worth 6.02dB of
signal to noise ratio. 1 more bit will drop the
noise floor by 6 dB. 1 less bit raises the noise
floor by 6dB. - NOTICE THAT NOISE SPREADS OVER THE ENTIRE OUTPUT
SPECTRUM.
56Why so much emphasis on how the noise spectrum
spreads out?
- Therein lies the beginnings of oversampling.
57Sine wave, original sampling rate
Spectrum of 8 bit quantization
Sine wave, 4x sampling rate
Spectrum of 7 bit quantization, shown in same
bandwidth! 4x oversampled.
Full spectrum of 7 bit quantized signal at 4x
sample rate. Notice that the noise has 4x the
bandwidth, but ¼ of it falls in the original
passband
58That demonstrates the most trivial form of
oversampling.
- This trivial form of oversampling provides the
equivalent of 1 bit, in-band, for every 4x the
sampling frequency, i.e. 3db per doubling. - Now we move on to more sophisticated forms of
oversampling, with the noise spectrum shaped as
well.
59Noise Shaping
Output Bits
Error Signal
H(s)
Quantizer
-
Quantized Signal
This is the basic form of a noise-shaper. Im not
going to do a full mathematical analysis for the
sake of time. What H(s) does is shape the noise
floor. This can be done with or without
oversampling. Two examples will follow. The
output bits of this system are PCM. ALL
OVERSAMPLED SYSTEMS ARE PCM SYSTEMS AT THEIR
HEART!!!!
60Adding this H(s) introduces some Issues, of
course.
- The values of H(s) must be carefully controlled
in order to ensure stability. - H(s) has storage in it, so quantization noise
gets stored. This means that you have more
noise than just the basic quantization noise.
So, there is a penalty, especially if there is a
lot of storage or memory in the noise shaper. - The shape of the noise is closely related to the
inverse of H(s). - I wont try to present a full analysis, the
results of one are forthcoming in the afternoon.
61An example of noise shaping with no oversampling
NOTE This is an example, nobody uses this
particular H(z), and in fact Ive not even tested
it for stability!!!! The point is simple, you CAN
do noise shaping even with no oversampling, and
some DACs do it, to attempt to match zero
loudness curves. We can discuss the utility of
that in Q/A.
62Whats the point?
- Within limits, using a noise-shaping system, you
can move the noise around in frequency. - You can, for instance, push lots of the noise up
to high frequencies. - That is one of the reasons for oversampling.
63Whats another reason for Oversampling?
- You get to control the response of the initial
anti-aliasing/anti-imaging filter digitally. - As most everyone knows by now, high-order analog
filters have a variety of problems - They are hard to manufacture
- They are expensive
- Since they are IIR filters, they have startling
phase problems near the stop band.
64What sort of oversampling does the analog filter
issue lead us to?
4x Oversampling, Digital FIR filter
5th order Analog filter
This is shown as an example
65The results?
- The 13th order analog filter (with horrible phase
response) is replaced by a 5th order analog
filter. - The first, sharp antialiasing filter is now a
digital filter, with deterministic behavior and
performance. All it takes is MIPS. - Nowdays, MIPS are cheap.
- The filter is trustworthy. It wont drift,
oscillate, distort, etc, if its designed and
implemented properly.
66Remember the Filters in the ear?
- Your ear is a time analyzer. It will analyze the
filter taps, etc, AS THEY ARRIVE, it doesnt wait
for the whole filter to arrive. - If the filter has substantial energy that leads
the main peak, this may be able to affect the
auditory system. - In Codecs this is a known, classic problem, and
one that is hard to solve. - In some older rate convertors, the pre-echo was
quite audible. - The point? We have to consider how the ear will
analyze any anti-aliasing filter. Two examples
follow.
67An example of a filter with passband ripple and
barely enough stop band rejection.
68An example of a good, longer filter, with less
passband ripple.
69An interesting result
- Trying to use the shortest possible filter (i.e.
minimizing MIPS) results in a worse time response
from the point of view of the auditory system. - Passband ripple means that there are tails on
the filter.
70Another interesting result
- Sharper filters have more ringing, and may have
more auditory problems - The main lobe of a filter cutting off in 2.05 kHz
must necessarily have a wider main lobe than the
narrowest (in time) cochlear filter. - The main lobe of a filter cutting off over 4kHz
will have a main lobe a bit smaller than the
narrowest cochlear filter. - This suggests that for higher sampling rates, we
do not want the fastest filter, rather a filter
with a wider transition band, and narrower time
response.
71Two examples
72(No Transcript)
73Is this audible?
- Thats a good question. Since we are stuck, in
general, with the filters our ADCs and DACs
use, its dreadfully hard to actually run this
listening test. - How would I do that?
- Get a DAC with a SLOW rolloff running at 4x
(192K). - Make a DC to 20 K Gaussian pulse at 192kHz.
- Downsample by zeroing 3 of every 4 samples and
multiplying the others by 4. - Generate a third signal with a TIGHT filter.
- Compare the three signals in a listening test.
74Is this 4x oversampling what people do?
- Not generally. Thats what they did for a while,
until MIPS got even cheaper. - What they did was go to more oversampling, a LOT
more oversampling? - Uses more digital, less analog
- There are a whole variety of circuitry and
linearity reasons, Steve will talk about them in
the afternoon. Almost all of them point toward
much more oversampling.
75Massive oversampling
- Remember One gets 3dB per doubling of Fs from
oversampling with a flat noise floor. - If we also put a single integrator with its zero
at 20kHz into H(s), we will see that the
increased SNR available is 3 6 db/doubling of
Fs. There will be some cost in the form of a
constant negative term to this SNR, which is
overcome by very moderate levels of oversampling. - Each additional order of integration adds another
6dB/doubling of the sampling frequency. - On the next page are some curves
767
Noise shape Vs. Order, integrator pole At w1
6
5
Note These examples do not include the
3dB/octave term from noise spreading.
4
3
2
1
(Note Curves as examples only. Real-world
circuit considerations limit these curves)
77Right. So what does that do for me, anyhow?
- Remember, nearly the same amount of noise is
being shaped in each case. - As there is more space under the curve at high
frequencies, more of the noise moves to high
frequencies. - That means there is LESS noise at low
frequencies. - Therefore, if we FILTER OUT the high frequencies,
we wind up with a lower sampling rate signal
with a higher SNR.
78Some Examples (low order)
Order 1
Order 2
1x
2x
4x
8x
16x
32x
64x
128x
Original SNR 0dB. 128x down sampling 48kHz
79SNR vs. order vs downsampling rate for a
hypothetical system.
80Details about those examples
- All of the examples have n integrators with a
knee at 20kHz. This is not necessarily the
optimum solution, it is used for example - The examples are theoretically calculated, there
is no component or electrical error involved.
81SUMMARY
82Auditory system characteristics
- Everything must be considered within the relevant
cochlear filter bandwidth. - 0dB SPL is slightly below atmospheric noise
level. - 120dB SPL is a good maximum, even that level is
very dangerous for hearing. - High frequency issues may be due to actual
hearing, to filter time response issues, or both. - Gradual filters are safer than steep filters.
83Quantization and Sampling
- Antialiasing and antiimaging filters are not just
a good idea, its a requirement. - Dithering is not just a good idea, its a
requirement. - There are many ways to quantize and sample.
84Convertor Technology
- Converter technology exists to do proper, clean
conversions. - There is no basic mathematical difference in the
result of SSR vs. a Delta-Sigma converter in
terms of what it delivers to the PCM system, the
differences are due to circuitry and cost issues. - Capturing the direct delta-sigma waveform (single
or multibit) can be done. One group of high-rate
proponents proposes such a system. The only
things this results in, practically, are the
removal of the sharp anti-aliasing filter and the
retention of the high-frequency noise.