Title: Audio, Images, and Video
1Audio, Images, and Video
- Efficient representation of media
- Larry
Denenberg
2Epitaph
John Backus, inventor of FORTRAN
19242007
3Epigraph
- Prof. Denenberg showed early promise as a
riveting lecturer who had done a lot of thinking
about higher education and would be a great
lecturer. Apparently, however, Prof. Denenberg
failed to realize that the CS121 style of
teaching -- read off your own notes that you
simultaneously put up on the overhead projector
and hand out to the students -- is the absolute
hands-down BEST way to reduce students to
stupefied, uninterested, uncaring blobs, as well
as the world's least effective way of encouraging
students to come to class.
4The Problem
- People want to listen to music, look at pictures,
and watch videos - Straightforward high-quality representation of
audio/images/video uses LOTS of space - Cellphones, iPods, etc., have limited space
- Networks have limited bandwidth
- Hence We need to represent efficiently
- (Small slice of an immense topic)
5Review Lossy vs. Lossless Compression
- Lossless compression Original can be retrieved
perfectly, bit for bit - Lossy compression Information is lost original
cannot be recovered precisely - When is loss acceptable?
- Parametrizable loss How bad can it be?
- Rest of this talk Lossy compression
- Info on lossless compression LD ch 5
6TransformsWhat is a transform?
- A transform is a way of changing the
representation of data, typically losslessly - We use transforms to put data into a more
convenient form - For example, a transform might rearrange data to
group together "useful" components - Example The Fourier transform, which decomposes
a signal into sine waves of various
frequencies---some not so useful
7Audio
- Gold standard is CD quality 2-channel 16-bit
PCM encoding at 44.1 KHz sampling - Multiply it out to get 10.56 MB per minute
- Recall that even this is an approximation!
- Digression on CDs Where is the music?
- Key to (lossy) compression Limitations of human
hearing - E.g., the ear hears nothing above 21KHz
8The mp3 strategy
- Transform to frequency domain filter music into
twelve bands chosen based on frequency response
of human ear - Use psychoacoustic modelling to determine how
much each band contributes to sound perception - Allocate bits to bands according to importance,
i.e., encode important bands more accurately
(more bits) than others
9Filter input into frequency bands
- Recall that any waveform can be written as a
combination of sine waves (pure tones) of various
frequencies and amplitudes - We can filter the input music into components,
each of which is made up of a narrow range
(band) of frequencies - Each band is encoded separately they are
combined when music is played - Digression mp3 standard specifies only
decoding, not encoding
10Psychoacoustic modelling
- When two tones are close in frequency, the louder
masks the softer that is, the softer is much
less audible than a tone of the same volume but
highly different frequency - A pure tone is not as effective at masking as is
noise, and is masked more easily - Soft sounds just before or just after loud sounds
are masked - All these factors are considered to calculate the
signal-to-mask ratio for each banda higher SMR
means a more important band
11Bit allocation
- With mp3, the bit rate is fixed in advance (e.g.
128 Kbps, 192 Kbps), where lower bit rate more
compression lower quality - So we know a priori how many bits we can use
(contrast with, e.g., Huffman encoding) - We use the results of psychoacoustic modelling to
apportion these bits among the frequency bands.
More bits more accurate coding more faithful
reproduction - VBR encoding tries to improve this
12Images
- RGB representation describe each color with 24
bits, 8 each of red, green, and blue - Write a color as 6 hex digits 0xRRGGBB
- An image is represented by pixels (picture
elements) each encodes a tiny colored dot - So a 4-megapixel image (2272 x 1704) takes about
12 MB of storage - Recall that even this is an approximation!
13(No Transcript)
14Simple compression techniques
- Reduce resolution by subsampling
- Fewer bits per color Use 5 bits for R B, 6
bits for G, to save 1/3 of the space(our eyes
perceive G better than R or B) - Fewer colors Use 8 bits to encode a color
palette of 256 carefully chosen colors - Make a table of 256 colors, image-specific
- Encode each color as an index into this table
- GIF and PNG format images work this way
15Dithering
- A region colored with sufficiently small pixels
of two different colors is perceived by the eye
as uniformly colored - With careful selection of the colors and the
densities of each, it is possible to imitate
colors not in the palette - Black and white dithering to get shades of grey
is called halftoning
16Dithering examples (from Wikipedia)
17JPEG step 1 Chroma subsampling
- Transform color space from RGB to YCbCr
- Y brightness, a measure of black vs white
- Cb blue chroma, a measure of blueness
- Cr red chroma, a measure of you-guess-what
- The eye is more sensitive to brightness than to
color, so we reduce the resolution of Cb and Cr
while maintaining the resolution of Y - JPEG subsampling is 420, meaning half as many
Cb/Cr samples in each dimension
18JPEG step 2 Brightness transform
- Differences in brightness are easy to see over
large regions, but rapidly varying brightness
over small areas is imperceptible - Therefore Transform the brightness information
to separate variations at "high frequency" from
those at "low frequency" - This is analogous to separating music into
frequency components and throwing away
frequencies too high for the ear to hear - Resulting data is losslessly encoded
19JPEG examples (Wikipedia again)
83K 15K 5K 1.5K
20Video
- A video stream is a sequence of frames
- HDTV is 1920 pixels horizontally by 1080 lines
vertically at 30 frames per second - At 24 bits/pixel, this is over 11 GB/minute and
that doesnt even include the audio! - Recall that even this is an approximation!
- We can cut this to about 30 MB/minute including
audio and other data
21Key to (lossy) compressionFrames are similar
- Often only a small portion of the screen changes
during 1/30 of a second - So In each frame, encode only pieces that have
changed since the preceding frame
22MPEG ImprovementHandling motion efficiently
- Encode frames in 16x16 blocks of pixels
- Each block is either
- explicitly specified
- unchanged from the preceding frame
- set equal to some block somewhere in a preceding
frame (i.e., with a motion vector) - An error correction can also be specified when a
new block is similar to an old one - Quickly finding the best encoding of a block is
the hard problem of MPEG encoding
23Example
- All we need encode in the second frame is that
the square has moved and that its old location is
taken from some other blank area the error
correction term can even handle the rotation
24How frames are put into video files
- First idea Encode the first frame explicitly,
encode the second based on the first, the third
based on the second, etc. indefinitely - Problem With this scheme, all playback must
start at the beginning of the file! - So we periodically insert I-frames, frames fully
specified without reference to any other frame
(an I-frame is essentially a JPEG) - Between the I-frames are the P-frames, which are
encoded w/r/t the preceding frame
25An Infelicity
- Theres no good way to encode the new frame from
the old one. But if the square continues moving
to the right, its easy to encode the new frame
based on the upcoming frame.
26Next improvementTo know the present, use the
future
- Some situations are not handled well by this
encoding, e.g. when objects appear or are
partially obscured or change background - We cope by introducing B-frames frames each of
whose blocks can be specified using either the
preceding or upcoming frame - Preceding and upcoming frames refer only to
I- or P-frames B-frames are not specified w/r/t
other B-frames - (Newer standards are even more general)
27Frame sequencing
- Consider the following (typical) frame sequence
I1 B1 B2 P1 B3 B4 P2 B5 B6 I2 . . . - Here
- I1 and I2 are fully specified
- P1 depends on I1 P2 depends on P1
- B1 and B2 depend on I1 and P1
- B3 and B4 depend on P1 and P2
- B5 and B6 depend on P2 and I2
28Frame sequencing, cont.
- I1 B1 B2 P1 B3 B4 P2 B5 B6 I2 . .
. - When it's time to display B1, we've seen only I1.
But B1 also may depend on P1! - To cope, the frames are arranged in a different
order in the actual video file I1 P1 B1
B2 P2 B3 B4 I2 B5 B6 . . . - Each frame follows all frames it depends on