CS 203A Computer Architecture Lecture 10: Multimedia and Multithreading - PowerPoint PPT Presentation

About This Presentation

Title:

CS 203A Computer Architecture Lecture 10: Multimedia and Multithreading

Description:

Aspect Oriented Software Development. Driving AOSD Technology within IBM ... AspectJ 1.1 recently awarded a Software Development Magazine Jolt Productivity Award ... – PowerPoint PPT presentation

Number of Views:221

Avg rating:3.0/5.0

Slides: 29

Provided by: aosd5

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 203A Computer Architecture Lecture 10: Multimedia and Multithreading

1
CS 203A Computer Architecture Lecture 10
Multimedia and Multithreading

Instructor L.N. Bhuyan

2
Approaches to Mediaprocessing
General-purpose processors with SIMD extensions
Vector Processors
VLIW with SIMD extensions (aka mediaprocessors)
Multimedia Processing
DSPs
ASICs/FPGAs
3
What is Multimedia Processing?

Desktop
3D graphics (games)
Speech recognition (voice input)
Video/audio decoding (mpeg-mp3 playback)
Servers
Video/audio encoding (video servers, IP
telephony)
Digital libraries and media mining (video
servers)
Computer animation, 3D modeling rendering
(movies)
Embedded
3D graphics (game consoles)
Video/audio decoding encoding (set top boxes)
Image processing (digital cameras)
Signal processing (cellular phones)

4
Characteristics of Multimedia Apps (1)

Requirement for real-time response
Incorrect result often preferred to slow result
Unpredictability can be bad (e.g. dynamic
execution)
Narrow data-types
Typical width of data in memory 8 to 16 bits
Typical width of data during computation 16 to
32 bits
64-bit data types rarely needed
Fixed-point arithmetic often replaces
floating-point
Fine-grain (data) parallelism
Identical operation applied on streams of input
data
Branches have high predictability
High instruction locality in small loops or
kernels

5
Characteristics of Multimedia Apps (2)

Coarse-grain parallelism
Most apps organized as a pipeline of functions
Multiple threads of execution can be used
Memory requirements
High bandwidth requirements but can tolerate high
latency
High spatial locality (predictable pattern) but
low temporal locality
Cache bypassing and prefetching can be crucial

6
SIMD Extensions for GPP

Motivation
Low media-processing performance of GPPs
Cost and lack of flexibility of specialized ASICs
for graphics/video
Underutilized datapaths and registers
Basic idea sub-word parallelism
Treat a 64-bit register as a vector of 2 32-bit
or 4 16-bit or 8 8-bit values (short vectors)
Partition 64-bit datapaths to handle multiple
narrow operations in parallel
Initial constraints
No additional architecture state (registers)
No additional exceptions
Minimum area overhead

7
Overview of SIMD Extensions
Vendor Extension Year Instr Registers
HP MAX-1 and 2 94,95 9,8 (int) Int 32x64b
Sun VIS 95 121 (int) FP 32x64b
Intel MMX 97 57 (int) FP 8x64b
AMD 3DNow! 98 21 (fp) FP 8x64b
Motorola Altivec 98 162 (int,fp) 32x128b (new)
Intel SSE 98 70 (fp) 8x128b (new)
MIPS MIPS-3D ? 23 (fp) FP 32x64b
AMD E 3DNow! 99 24 (fp) 8x128 (new)
Intel SSE-2 01 144 (int,fp) 8x128 (new)
8
Intel MMX Piipeline
9
Performance Improvement in MMX Architecture
10
SIMD Performance

Limitations
Memory bandwidth
Overhead of handling alignment and data width
adjustments

11
Other Features for Multimedia

Support for fixed-point arithmetic
Saturation, rounding-modes etc
Permutation instructions of vector registers
For reductions and FFTs
Not general permutations (too expensive)
Example permutation for reductions
Move 2nd half a a vector register into another
one
Repeatedly use with vadd to execute reduction
Vector length halved after each step

12
Multithreading

Consider the following sequence of instructions
through a pipeline
LW r1, 0(r2)
LW r5, 12(r1)
ADDI r5, r5, 12
SW 12(r1), r5

13
Multithreading

How can we guarantee no dependencies between
instructions in a pipeline?
One way is to interleave execution of
instructions from different program threads on
same pipeline Micro context switching
Interleave 4 threads, T1-T4, on non-bypassed
5-stage pipe
T1 LW r1, 0(r2)
T2 ADD r7, r1, r4
T3 XORI r5, r4, 12
T4 SW 0(r7), r5
T1 LW r5, 12(r1)

14
Avoiding Memory Latency

General processors switch to another context on
I/O operation gt Multithreading,
Multiprogramming, etc. An O/S function. Large
overhead! Why?
Why not context switch on a cache miss? gt
Hardware multithreading.
Can we afford that overhead now? gt Need changes
in architecture to avoid stack operations. How to
achieve it?
Have many contexts CPU resident (not memory
resident) by having separate PCs and registers
for each thread. No need to store them in stack
on context switching.

15
Simple Multithreaded Pipeline

Have to carry thread select down pipeline to
ensure correct state bits read/written at each
pipe stage

16
Multithreading Costs

Appears to software (including OS) as multiple
slower CPUs
Each thread requires its own user state
GPRs
PC
Also, needs own OS control state
virtual memory page table base register
exception handling registers
Other costs?

17
What Grain Multithreading?

So far assumed fine-grained multithreading
CPU switches every cycle to a different thread
When does this make sense?
Coarse-grained multithreading
CPU switches every few cycles to a different
thread
When does this make sense (Ex - Memory Access?
NPs)?

18
Superscalar Machine Efficiency

Why horizontal waste?
Why vertical waste?

19
Vertical Multithreading

Cycle-by-cycle interleaving of second thread
removes vertical waste

20
Ideal Multithreading for Superscalar

Interleave multiple threads to multiple issue
slots with no restrictions

21
Simultaneous Multithreading

Add multiple contexts and fetch engines to wide
out-of-order superscalar processor
Tullsen, Eggers, Levy, UW, 1995
OOO instruction window already has most of the
circuitry required to schedule from multiple
threads
Any single thread can utilize whole machine

22
Comparison of Issue CapabilitiesCourtesy of
Susan Eggers Used with Permission
23
From Superscalar to SMT

Small items
per-thread program counters
per-thread return stacks
per-thread bookkeeping for instruction
retirement, trap instruction dispatch queue
flush
thread identifiers, e.g., with BTB TLB entries

24
Simultaneous Multithreaded Processor
25
Intel Pentium-4 Xeon Processor

Hyperthreading SMT
Dual physical processors, each 2-way SMT
Logical processors share nearly all resources of
the physical processor
Caches, execution units, branch predictors
Die area overhead of hyperthreading 5
When one logical processor is stalled, the other
can make progress
No logical processor can use all entries in
queues when two threads are active
A processor running only one active software
thread to run at the same speed with or without
hyperthreading

26
Intel Hyperthreading Implementation See
attached paper Note separate buffer
space/registers for the second thread
27
(No Transcript)
28
Intel Xeon Performance

Write a Comment

User Comments (0)