Title: CS 203A Computer Architecture Lecture 10: Multimedia and Multithreading
1CS 203A Computer Architecture Lecture 10
Multimedia and Multithreading
2Approaches to Mediaprocessing
General-purpose processors with SIMD extensions
Vector Processors
VLIW with SIMD extensions (aka mediaprocessors)
Multimedia Processing
DSPs
ASICs/FPGAs
3What is Multimedia Processing?
- Desktop
- 3D graphics (games)
- Speech recognition (voice input)
- Video/audio decoding (mpeg-mp3 playback)
- Servers
- Video/audio encoding (video servers, IP
telephony) - Digital libraries and media mining (video
servers) - Computer animation, 3D modeling rendering
(movies) - Embedded
- 3D graphics (game consoles)
- Video/audio decoding encoding (set top boxes)
- Image processing (digital cameras)
- Signal processing (cellular phones)
4Characteristics of Multimedia Apps (1)
- Requirement for real-time response
- Incorrect result often preferred to slow result
- Unpredictability can be bad (e.g. dynamic
execution) - Narrow data-types
- Typical width of data in memory 8 to 16 bits
- Typical width of data during computation 16 to
32 bits - 64-bit data types rarely needed
- Fixed-point arithmetic often replaces
floating-point - Fine-grain (data) parallelism
- Identical operation applied on streams of input
data - Branches have high predictability
- High instruction locality in small loops or
kernels
5Characteristics of Multimedia Apps (2)
- Coarse-grain parallelism
- Most apps organized as a pipeline of functions
- Multiple threads of execution can be used
- Memory requirements
- High bandwidth requirements but can tolerate high
latency - High spatial locality (predictable pattern) but
low temporal locality - Cache bypassing and prefetching can be crucial
6SIMD Extensions for GPP
- Motivation
- Low media-processing performance of GPPs
- Cost and lack of flexibility of specialized ASICs
for graphics/video - Underutilized datapaths and registers
- Basic idea sub-word parallelism
- Treat a 64-bit register as a vector of 2 32-bit
or 4 16-bit or 8 8-bit values (short vectors) - Partition 64-bit datapaths to handle multiple
narrow operations in parallel - Initial constraints
- No additional architecture state (registers)
- No additional exceptions
- Minimum area overhead
7Overview of SIMD Extensions
Vendor Extension Year Instr Registers
HP MAX-1 and 2 94,95 9,8 (int) Int 32x64b
Sun VIS 95 121 (int) FP 32x64b
Intel MMX 97 57 (int) FP 8x64b
AMD 3DNow! 98 21 (fp) FP 8x64b
Motorola Altivec 98 162 (int,fp) 32x128b (new)
Intel SSE 98 70 (fp) 8x128b (new)
MIPS MIPS-3D ? 23 (fp) FP 32x64b
AMD E 3DNow! 99 24 (fp) 8x128 (new)
Intel SSE-2 01 144 (int,fp) 8x128 (new)
8Intel MMX Piipeline
9Performance Improvement in MMX Architecture
10SIMD Performance
- Limitations
- Memory bandwidth
- Overhead of handling alignment and data width
adjustments
11Other Features for Multimedia
- Support for fixed-point arithmetic
- Saturation, rounding-modes etc
- Permutation instructions of vector registers
- For reductions and FFTs
- Not general permutations (too expensive)
- Example permutation for reductions
- Move 2nd half a a vector register into another
one - Repeatedly use with vadd to execute reduction
- Vector length halved after each step
12Multithreading
- Consider the following sequence of instructions
through a pipeline - LW r1, 0(r2)
- LW r5, 12(r1)
- ADDI r5, r5, 12
- SW 12(r1), r5
13Multithreading
- How can we guarantee no dependencies between
instructions in a pipeline? - One way is to interleave execution of
instructions from different program threads on
same pipeline Micro context switching - Interleave 4 threads, T1-T4, on non-bypassed
5-stage pipe - T1 LW r1, 0(r2)
- T2 ADD r7, r1, r4
- T3 XORI r5, r4, 12
- T4 SW 0(r7), r5
- T1 LW r5, 12(r1)
14Avoiding Memory Latency
- General processors switch to another context on
I/O operation gt Multithreading,
Multiprogramming, etc. An O/S function. Large
overhead! Why? - Why not context switch on a cache miss? gt
Hardware multithreading. - Can we afford that overhead now? gt Need changes
in architecture to avoid stack operations. How to
achieve it? - Have many contexts CPU resident (not memory
resident) by having separate PCs and registers
for each thread. No need to store them in stack
on context switching.
15Simple Multithreaded Pipeline
- Have to carry thread select down pipeline to
ensure correct state bits read/written at each
pipe stage
16Multithreading Costs
- Appears to software (including OS) as multiple
slower CPUs - Each thread requires its own user state
- GPRs
- PC
- Also, needs own OS control state
- virtual memory page table base register
- exception handling registers
- Other costs?
17What Grain Multithreading?
- So far assumed fine-grained multithreading
- CPU switches every cycle to a different thread
- When does this make sense?
- Coarse-grained multithreading
- CPU switches every few cycles to a different
thread - When does this make sense (Ex - Memory Access?
NPs)?
18Superscalar Machine Efficiency
- Why horizontal waste?
- Why vertical waste?
19Vertical Multithreading
- Cycle-by-cycle interleaving of second thread
removes vertical waste
20Ideal Multithreading for Superscalar
- Interleave multiple threads to multiple issue
slots with no restrictions
21Simultaneous Multithreading
- Add multiple contexts and fetch engines to wide
out-of-order superscalar processor - Tullsen, Eggers, Levy, UW, 1995
- OOO instruction window already has most of the
circuitry required to schedule from multiple
threads - Any single thread can utilize whole machine
22Comparison of Issue CapabilitiesCourtesy of
Susan Eggers Used with Permission
23From Superscalar to SMT
- Small items
- per-thread program counters
- per-thread return stacks
- per-thread bookkeeping for instruction
retirement, trap instruction dispatch queue
flush - thread identifiers, e.g., with BTB TLB entries
24Simultaneous Multithreaded Processor
25Intel Pentium-4 Xeon Processor
- Hyperthreading SMT
- Dual physical processors, each 2-way SMT
- Logical processors share nearly all resources of
the physical processor - Caches, execution units, branch predictors
- Die area overhead of hyperthreading 5
- When one logical processor is stalled, the other
can make progress - No logical processor can use all entries in
queues when two threads are active - A processor running only one active software
thread to run at the same speed with or without
hyperthreading
26Intel Hyperthreading Implementation See
attached paper Note separate buffer
space/registers for the second thread
27(No Transcript)
28Intel Xeon Performance