Title: CS598kn Basic Concepts of Audio, Video and Compression
1CS598kn Basic Concepts of Audio, Video and
Compression
- Klara Nahrstedt
- 01/21/2005
2Content
- Introduction on Multimedia
- Audio encoding
- Video encoding
- Compression
3Video on Demand
- Video On Demand (a) ADSL vs. (b) cable
4Multimedia Files
- A movie may consist of several files
5Multimedia Issues
- Analog to digital
- Problem need to be acceptable by ears or eyes
- Jitter
- Require high data rate
- Large storage
- Compression
- Require real-time playback
- Scheduling
- Quality of service
- Resource reservation
6Audio
- Sound is a continuous wave that travels through
the air. - The wave is made up of pressure differences.
7How do we hear sound?
- Sound is detected by measuring the pressure level
at a point - When an acoustic signal reaches the otter- ear
(Pinna), the generated wave will be transformed
into energy and filtered through the middle-ear.
The inner-ear (Cochlea) transforms the energy
into nerve activity. - In similar way, when an acoustic wave strikes a
microphone, the microphone generates an
electrical signal, representing the sound
amplitude as a function of time.
8Basic Sound Concepts
- Frequency represents the number of periods in a
second (measured in hertz, cycles/second) - Human hearing frequency range 20 Hz - 20 kHz
(audio), voice is about 500 Hz to 2 kHz. - Amplitude of a sound is the measure of
displacement of the air pressure wave from its
mean.
9Computer Representation of Audio
- Speech is analog in nature and it is converted to
digital form by an analog-to-digital converter
(ADC). - A transducer converts pressure to voltage levels.
- Convert analog signal into a digital stream by
discrete sampling - Discretization both in time and amplitude
(quantization)
10Audio Encoding (1)
- Audio Waves Converted to Digital
- electrical voltage input
- sample voltage levels at intervals to get a
vector of values (0, 0.2, 0.5, 1.1, 1.5, 2.3,
2.5, 3.1, 3.0, 2.4,...) - A computer measures the amplitude of the waveform
at regular time intervals to produce a series of
numbers (samples). - The ADC process is governed by four factors
sampling rate, quantization, linearity, and
conversion speed. binary number as output
11Audio Encoding (2)
- Sampling Rate rate at which a continuous wave is
sampled (measured in Hertz) - Examples CD standard - 44100 Hz, Telephone
quality - 8000 Hz - The audio industry uses 5.0125 kHz, 11.025 kHz,
22.05 kHz, and 44.1 kHz as the standard sampling
frequencies. These frequencies are supported by
most sound cards. - Question How often do you need to sample a
signal to avoid losing information?
12Audio Encoding (3)
- Answer It depends on how fast the signal is
changing,. Real Answer twice per cycle (this
follows from Nyquist sampling theorem) - Nyquist Sampling Theorem If a signal f(t) is
sampled at regular intervals of time and at a
rate higher than twice the highest significant
signal frequency, then the samples contain all
the information of the original signal. - Example CD's actual sampling frequency is 22050
Hz, but because of Nyquist's Theorem, we need to
sample the signal twice, therefore the sampling
frequency is 44100Hz.
13Audio Encoding (4)
- The best-known technique for voice digitization
is Pulse-Code Modulation (PCM). - PCM is based on the sampling theorem.
- If voice data are limited to 4000 Hz, then PCM
samples 8000 samples/second which is sufficient
for the input voice signal. - PCM provides analog samples which must be
converted to digital representation. Each of
these analog samples must be assigned a binary
code. Each sample is approximated by being
quantized as explained above.
14Audio Encoding (5)
- Quantization (sample precision) the resolution
of a sample value. - Samples are typically stored as raw numbers
(linear PCM format) or as logarithms (u-law or
A-law) - Quantization depends on the number of bits used
measuring the height of the waveform - Example 16-bit CD quality quantization results
in over 65536 values - Audio Formats are described by the sample rate
and quantization - Voice quality 8-bit quantization, 8000 Hz u-law
mono (8kBytes/s) - 22 kHz 8-bit linear mono (22 kBytes/second) and
stereo (44 kBytes/s) - CD quality 16-bit quantization, 44100 Hz linear
stereo (176.4 kBytes/s 44100 samples x 16
bits/sample x 2 (two channels)/8000)
15Audio Formats
- Audio formats are characterized by four
parameters - Sample Rate sampling frequency per second
- Encoding Audio data representation
- U-law CCITT G.711 standard for voice data in
telephone companies (USA, Canada, Japan) - A-law CCITT G.711 standard for voice data in
telephony elsewhere (Europe,) - A-law and u-law are sampled at 8000
samples/second with precision of 12 bits and
compressed to 8 bit samples. - Linear PCM uncompressed audio where samples are
proportional to audio signal voltage - Precision number of bits used to store each
audio sample. - Channel Multiple channels of audio may be
interleaved at sample boundaries
16Basic Concepts of Video
- Visual Representation Video Encoding
- Objective is to offer the viewer a sense of
presence in the scene and of participation in the
events portrayed - Transmission
- Video signals are transmitted to a receiver
through a single television channel - Digitalization
- Analog to digital conversion, sampling of
gray/color level - Quantization
17Visual Representation (1)
- Video Signals are generated at the output of a
camera by scanning a two-dimensional moving scene
and converting them into a onedimensional
electric signal - A moving scene is a collection of individual
images, where each scanned picture generates a
frame of the picture - Scanning starts at the top-left corner of the
picutre and ends at the bottom-right. - Aspect Ratio ratio of picture width and height
- Pixel discrete picture element digitized
light point in a frame - Vertical Frame Resolution number of pixels in
picture height - Horizontal Frame Resolution number of pixels in
picture width - Spatial Resolution Vertical x Horizontal
Resolution - Temporal Resolution Rapid succession of
different frames
18Visual Representation (2)
- Continuity of Motion
- Minimum 15 frames per second
- NTSC 29.97 Hz repetition rate, 30 frames/sec
- PAL 25 HZ, 25 frames/sec
- HDTV 59.94 Hz, 60 frames/sec
- Flicker Effect
- is a periodic fluctuation of brightness
perception. To avoid this effect, we need at
least 50 refresh cycles per second. This is done
by display devices using display refresh buffer - Picture Scanning
- Progressive scanning single scanning of a
picture - Interlaced scanning the frame is formed by
scanning two pictures at different times, with
the lines interleaved, such that two consecutive
lines of a frame belong to alternative field
(scan odd and even lines separately) - NTSC TV uses interlaced scanning to trade-off
vertical with temporal resolution - HDTV and computer displays are high
spatio-temporal videos and use progressive
scanning.
19Video Color Encoding (3)
- During the scanning, a camera creates three
signals RGB (red, greed and blue) signals. - For compatibility with black-and-white video and
because of the fact that the three color signals
are highly correlated, a new set of signals of
different space are generated. - The new color systems correspond to the standards
such as NTCS, PAL, SECAM. - For transmission of the visual signal we use
three signals 1 luminance (brightness- basic
signal) and 2 chrominance (color signals). - YUV Signal (Y 0.3R0.59G0.11B, U (B-Y) x
0.493, V(R-Y) x 0.877) - Coding ratio between components Y,U,V 422
- In NTSC signal the luminance and chrominance
signals are interleaved - The goal at the receiver is (1) separate
luminance from chrominance components, and (2)
avoid interference between them (cross-color,
cross luminance)
20Basic Concepts of Image Formats
- Important Parameters for Captured Image Formats
- Spatial Resolution (pixels x pixels)
- Color encoding (quantization level of a pixel
e.g., 8-bit, 24-bit) - Examples SunVideo' Video Digitizer Board allows
pictures of 320 by 240 pixels with 8-bit
gray-scale or color resolution. - For a precise demonstration of image basic
concepts try the program xv which displays images
and allows to show, edit and manipulate the image
characteristics. - Important Parameters for Stored Image Formats
- Images are stored as a 2D array of values where
each value represents the data associated with a
pixel in the image (bitmap or a color image). - The stored images can use flexible formats such
as the RIFF (Resource Interchange File Format).
RIFF includes formats such as bitmats,
vector-representations, animations, audio and
video. - Currently, most used image storage formats are
GIF (Graphics Interchange Format), XBM (X11
Bitmap), Postscript, JPEG (see compression
chapter), TIFF (Tagged Image File Format), PBM
(Portable Bitmap), BMP (Bitmap).
21Digital Video
- The process of digitizing analog video involves
- Filtering, sampling, quantization
- Filtering is employed to avoid the aliasing
artifacts of the follow-up sampling process - Filtered luminance and chrominance signals are
sampled to generate a discrete time signal - Digitization means sampling gray/color levels in
the frame at MxN array of points - The minimum rate at which each component (YUV)
can be sampled is the Nyquist rate and
corresponds to twice the signal bandwidth - Once the points are sampled , they are quantized
into pixels i.e., the sampled value is mapped
into integer. The quantization level depends on
how many bits do we allocate to present the
resulting integer (e.g., 8 bits per pixel, or 24
bits per pixel)
22Digital Transmission Bandwidth
- Bandwidth requirements for Images
- Raw Image Transmission Bandwidth size of the
image spatial resolution x pixel resolution - Compressed Image Transmission Bandwidth
depends on the compression scheme(e.g., JPEG) and
content of the image - Symbolic Image Transmission bandwidth size of
the instructions and variables carrying graphics
primitives and attributes. - Bandwidth Requirements for Video
- Uncompressed Video Bandwidth image size x frame
rate - Compressed Video Bandwidth depends on the
compression scheme(e.g., Motion JPEG, MPEG) and
content of the video (scene changes). - Example Assume the following video
characteristics - 720,000 pixels per
image(frame), 8 bits per pixel quantization, and
60 frames per second frame rate. The Video
Bandwidth 720,000 pixels per frame x 8 bits per
pixel x 60 fps - which results in HDTV data rate of 43,200,000
bytes per second 345.6 Mbps When we use MPEG
compression, the bandwidth goes to 34 Mbps wit
some loss in image/video quality.
23Compression Classification
- Compression is important due to limited bandwidth
- All compression systems require two algorithms
- Encoding at the source
- Decoding at the destination
- Entropy Coding
- Lossless encoding
- Used regardless of medias specific
characteristics - Data taken as a simple digital sequence
- Decompression process regenerates data completely
- Examples Run-length coding, Huffman coding,
Arithmetic coding - Source Coding
- Lossy encoding
- Takes into account the semantics of data
- Degree of compression depends on data content
- Examples DPCM, Delta Modulation
- Hybrid Coding
- Combined entropy coding with source coding
- Examples JPEG, MPEG, H.263,
24Compression (1)
Uncompressed Picture
Picture Preparation
Picture Processing
Adaptive Feedback
Quantization
Entropy Encoding
25Compression (2)
- Picture Preparation
- Analog-to-digital conversion
- Generation of appropriate digital representation
- Image division into 8x8 blocks
- Fix the number of bits per pixel
- Picture Processing (Compression Algorithm)
- Transformation from time to frequency domain
(e.g., Discrete Cosine Transformation DCT) - Motion vector computation for motion video
- Quantization
- Mapping real numbers to integers (reduction in
precision) - Entropy Coding
- Compress a sequential digital stream without loss
26Compression (3)(Entropy Encoding)
- Simple lossless compression algorithm is the
Run-length Coding, where multiple occurring bytes
are grouped together as Number-OccuranceSpecial-Ch
aracterCompressed-Byte. For example,
'AAAAAABBBBBDDDDDAAAAAAAA' can be encoded as
6!A5!B5!D8!A', where !' is the special
character. The compression ratio is 50
(12/24100). - Fixed-length Coding each symbol gets allocated
the same number of bits independent of frequency
(L log2(N)), N number of symbols - Statistical Encoding each symbol has a
probability of frequency (e.g., P(A) 0.16, P(B)
0.51, P(C) 0.33) - The theoretical minimum average number of bits
per codeword is known as Entropy (H). According
to Shannon - H SPilog2Pi bits per codeword
27Huffman Coding
P(ACB) 1
0
1
P(AC) 0.49
P(B) 0.51
0
1
Symbol Code A 00 C
01 B 1
P(A) 0.16
P (C) 0.33
28JPEG Joint Photographic Experts Group
- 6 major steps to compress an image (1) block
preparation, (2) DCT (Discrete Cosine Transform)
transformation, (3) quantization, (4) further
compression via differential compression, (5)
zig-zag scanning and run-length coding
compression, (6) Huffman coding compression - Quantization step represents the lossy step where
we loose data in a non-invertible fashion. - Differential compression means that we consider
similar blocks in the image and encode only the
first block and for the rest of the similar
blocks, we encode only differences between the
previous block and current block. The hope is
that the difference is a much smaller value,
hence we need less bits to represent it. Also
often the differences end up close to 0 and can
be very well compressed by the next compression -
run-length coding. - Huffman compression is a lossless statistical
encoding algorithm which takes into account
frequency of occurrence (not each byte has the
same weight)
29JPEG Block Preparation
- RGB input data and block preparation
- Eyes responds to luminance (Y) more than
chrominance (I and Q)
30Image Processing
- After image preparation we have
- Uncompressed image samples grouped into data
units of 8x8 pixels - Precision 8bits/pixel
- Values are in the range of 0,255
- Steps in image processing
- Pixel values are shifted into the range
-128,127 with center 0 - DCT maps values from time to frequency domain
- S(u,v) ¼ C(u)C(v)SSS(x,y)cos(2x1)up/16
cos(2y1)vp/16 - S(0,0) lowest frequency in both directions DC
coefficient determines the fundamental color of
the block - S(0,1), , S(7,7) AC coefficients
31Quantization
- Goal of quantization is to throw out bits
- Consider example 1011012 45 (6bits) we can
truncate this string to 4 bits 10112 11 or to
3 bits 1012 5 (original value 40) or 1102 6
(original value 48) - Uniform quantization is achieved by dividing DCT
coefficient value S(u,v) by N and round the
result. - JPEG uses quantization tables
32Entropy Encoding
- After image processing we have quantized DC and
AC coefficients - Initial step of entropy encoding is to map 8x8
plane into 64 element vector using Zig-Zag
- DC Coefficient Processing use
- Difference coding
- AC Coefficient Processing apply
- Run-length coding
- Apply Huffman Coding on DC and
- AC coefficients
33MPEG Motion Picture Experts Group
- MPEG-1 was designed for video recorder-quality
output (320x240 for NTSC) using the bit rate of
1.2 Mpbs. - MPEG-2 is for broadcast quality video into
4-6Mbps (it fits into the NTSC or PAL broadcast
channel) - MPEG takes advantage of temporal and spatial
redundancy. Temporal redundancy means that two
neighboring frames are similar, almost identical.
- MPEG-2 output consists of three different kinds
of frames that have to be processed - I (Intracoded) frames - self-contained
JPEG-encoded still pictures - P (Predictive) frames - Block-by-block difference
with the last frame - B (Bidirectional) frames - Differences with the
last and next frames
34The MPEG Standard
- I frames - self-contained, hence they are used
for fast forward and rewind operations in VOD
applications - P frames code interfame differences. The
algorithm searches for similar macroblocks in the
current and previous frame, and if they are only
slightly different, it encodes only the
difference and the motion vector in order to find
the position of the macroblock for decoding. - B frames - encoded if three frames are available
at once the past one, the current one and the
future one. Similar to P frame, the algorithm
takes a macroblock in the current frame and looks
for similar macroblocks in the past and future
frames. - MPEG is suitable for stored video because it is
an asymmetric lossy compression. The encoding
takes long time, but the decoding is very fast. - The frames are delivered at the receiver in the
dependency order rather than display order, hence
we need buffering to reorder the frames.
35MPEG/Video I-Frames
- I frames (intra-coded images)
- MPEG uses JPEG compression algorithm for I-frame
encoding - I-frames use 8x8 blocks defined within a
macro-block. On these blocks, DCT is performed.
Quantization is done by a constant value for all
DCT coefficients, it means no quantization tables
exist as it is the case in JPEG
36MPEG/Video P-Frames
- P-frames (predictive coded frames) requires
previous I-frame and/or previous P-frame for
encoding and decoding - Use motion estimation method at the encoder
- Define match window within a given search window.
Match window corresponds to macro-block, search
window corresponds to an arbitrary window size
depending how far away we are willing to look. - Matching methods
- SSD correlation uses SSD Si(xi-yi)2
- SDA correlation uses SAD Sixi-yi
37MPEG/Video B-Frame
- B-frames (bi-directionally predictive-coded
frames) require information of the previous and
following I and/or P-frame
- ½ x (
)
DCT Quant. RLE
0001110
Motion Vectors (two)
Huffman Coding
38MPEG/Audio Encoding
- Precision is 16 bits
- Sampling frequency is 32 KHz, 44.1 KHz, 48 KHz
- 3 compression methods exist Layer 1, Layer 2,
Layer 3 (MP3)
Decoder is accepting layer 2 and layer 1 32
kbps-320kpbs, target 64 kbps Decoder is
accepting layer 1 32kbps-384kbps, target 128
kbps 32kbps-448kbps, target 192 kbps
Layer 3
Layer 2
Layer 1
39MPEG/System Data Stream
- Video is interleaved with audio.
- Audio consists of three layers
- Video consists of 6 layers
- (1) sequence layer
- (2) group of pictures layer (Video Param,
Bitstream Param, ) - (3) picture layer (Time code, GOP Param, )
- (4) slice layer (Type, Buffer Param, Encode
Param, .) - (5) macro-block layer (Qscale, )
- (6) block layer (Type, Motion Vector, Qscale,)