CS 203A Computer Architecture Lecture 10: Multimedia and Multithreading - PowerPoint PPT Presentation

About This Presentation
Title:

CS 203A Computer Architecture Lecture 10: Multimedia and Multithreading

Description:

Aspect Oriented Software Development. Driving AOSD Technology within IBM ... AspectJ 1.1 recently awarded a Software Development Magazine Jolt Productivity Award ... – PowerPoint PPT presentation

Number of Views:221
Avg rating:3.0/5.0
Slides: 29
Provided by: aosd5
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: CS 203A Computer Architecture Lecture 10: Multimedia and Multithreading


1
CS 203A Computer Architecture Lecture 10
Multimedia and Multithreading
  • Instructor L.N. Bhuyan

2
Approaches to Mediaprocessing
General-purpose processors with SIMD extensions
Vector Processors
VLIW with SIMD extensions (aka mediaprocessors)
Multimedia Processing
DSPs
ASICs/FPGAs
3
What is Multimedia Processing?
  • Desktop
  • 3D graphics (games)
  • Speech recognition (voice input)
  • Video/audio decoding (mpeg-mp3 playback)
  • Servers
  • Video/audio encoding (video servers, IP
    telephony)
  • Digital libraries and media mining (video
    servers)
  • Computer animation, 3D modeling rendering
    (movies)
  • Embedded
  • 3D graphics (game consoles)
  • Video/audio decoding encoding (set top boxes)
  • Image processing (digital cameras)
  • Signal processing (cellular phones)

4
Characteristics of Multimedia Apps (1)
  • Requirement for real-time response
  • Incorrect result often preferred to slow result
  • Unpredictability can be bad (e.g. dynamic
    execution)
  • Narrow data-types
  • Typical width of data in memory 8 to 16 bits
  • Typical width of data during computation 16 to
    32 bits
  • 64-bit data types rarely needed
  • Fixed-point arithmetic often replaces
    floating-point
  • Fine-grain (data) parallelism
  • Identical operation applied on streams of input
    data
  • Branches have high predictability
  • High instruction locality in small loops or
    kernels

5
Characteristics of Multimedia Apps (2)
  • Coarse-grain parallelism
  • Most apps organized as a pipeline of functions
  • Multiple threads of execution can be used
  • Memory requirements
  • High bandwidth requirements but can tolerate high
    latency
  • High spatial locality (predictable pattern) but
    low temporal locality
  • Cache bypassing and prefetching can be crucial

6
SIMD Extensions for GPP
  • Motivation
  • Low media-processing performance of GPPs
  • Cost and lack of flexibility of specialized ASICs
    for graphics/video
  • Underutilized datapaths and registers
  • Basic idea sub-word parallelism
  • Treat a 64-bit register as a vector of 2 32-bit
    or 4 16-bit or 8 8-bit values (short vectors)
  • Partition 64-bit datapaths to handle multiple
    narrow operations in parallel
  • Initial constraints
  • No additional architecture state (registers)
  • No additional exceptions
  • Minimum area overhead

7
Overview of SIMD Extensions
Vendor Extension Year Instr Registers
HP MAX-1 and 2 94,95 9,8 (int) Int 32x64b
Sun VIS 95 121 (int) FP 32x64b
Intel MMX 97 57 (int) FP 8x64b
AMD 3DNow! 98 21 (fp) FP 8x64b
Motorola Altivec 98 162 (int,fp) 32x128b (new)
Intel SSE 98 70 (fp) 8x128b (new)
MIPS MIPS-3D ? 23 (fp) FP 32x64b
AMD E 3DNow! 99 24 (fp) 8x128 (new)
Intel SSE-2 01 144 (int,fp) 8x128 (new)
8
Intel MMX Piipeline
9
Performance Improvement in MMX Architecture
10
SIMD Performance
  • Limitations
  • Memory bandwidth
  • Overhead of handling alignment and data width
    adjustments

11
Other Features for Multimedia
  • Support for fixed-point arithmetic
  • Saturation, rounding-modes etc
  • Permutation instructions of vector registers
  • For reductions and FFTs
  • Not general permutations (too expensive)
  • Example permutation for reductions
  • Move 2nd half a a vector register into another
    one
  • Repeatedly use with vadd to execute reduction
  • Vector length halved after each step

12
Multithreading
  • Consider the following sequence of instructions
    through a pipeline
  • LW r1, 0(r2)
  • LW r5, 12(r1)
  • ADDI r5, r5, 12
  • SW 12(r1), r5

13
Multithreading
  • How can we guarantee no dependencies between
    instructions in a pipeline?
  • One way is to interleave execution of
    instructions from different program threads on
    same pipeline Micro context switching
  • Interleave 4 threads, T1-T4, on non-bypassed
    5-stage pipe
  • T1 LW r1, 0(r2)
  • T2 ADD r7, r1, r4
  • T3 XORI r5, r4, 12
  • T4 SW 0(r7), r5
  • T1 LW r5, 12(r1)

14
Avoiding Memory Latency
  • General processors switch to another context on
    I/O operation gt Multithreading,
    Multiprogramming, etc. An O/S function. Large
    overhead! Why?
  • Why not context switch on a cache miss? gt
    Hardware multithreading.
  • Can we afford that overhead now? gt Need changes
    in architecture to avoid stack operations. How to
    achieve it?
  • Have many contexts CPU resident (not memory
    resident) by having separate PCs and registers
    for each thread. No need to store them in stack
    on context switching.

15
Simple Multithreaded Pipeline
  • Have to carry thread select down pipeline to
    ensure correct state bits read/written at each
    pipe stage

16
Multithreading Costs
  • Appears to software (including OS) as multiple
    slower CPUs
  • Each thread requires its own user state
  • GPRs
  • PC
  • Also, needs own OS control state
  • virtual memory page table base register
  • exception handling registers
  • Other costs?

17
What Grain Multithreading?
  • So far assumed fine-grained multithreading
  • CPU switches every cycle to a different thread
  • When does this make sense?
  • Coarse-grained multithreading
  • CPU switches every few cycles to a different
    thread
  • When does this make sense (Ex - Memory Access?
    NPs)?

18
Superscalar Machine Efficiency
  • Why horizontal waste?
  • Why vertical waste?

19
Vertical Multithreading
  • Cycle-by-cycle interleaving of second thread
    removes vertical waste

20
Ideal Multithreading for Superscalar
  • Interleave multiple threads to multiple issue
    slots with no restrictions

21
Simultaneous Multithreading
  • Add multiple contexts and fetch engines to wide
    out-of-order superscalar processor
  • Tullsen, Eggers, Levy, UW, 1995
  • OOO instruction window already has most of the
    circuitry required to schedule from multiple
    threads
  • Any single thread can utilize whole machine

22
Comparison of Issue CapabilitiesCourtesy of
Susan Eggers Used with Permission
23
From Superscalar to SMT
  • Small items
  • per-thread program counters
  • per-thread return stacks
  • per-thread bookkeeping for instruction
    retirement, trap instruction dispatch queue
    flush
  • thread identifiers, e.g., with BTB TLB entries

24
Simultaneous Multithreaded Processor
25
Intel Pentium-4 Xeon Processor
  • Hyperthreading SMT
  • Dual physical processors, each 2-way SMT
  • Logical processors share nearly all resources of
    the physical processor
  • Caches, execution units, branch predictors
  • Die area overhead of hyperthreading 5
  • When one logical processor is stalled, the other
    can make progress
  • No logical processor can use all entries in
    queues when two threads are active
  • A processor running only one active software
    thread to run at the same speed with or without
    hyperthreading

26
Intel Hyperthreading Implementation See
attached paper Note separate buffer
space/registers for the second thread
27
(No Transcript)
28
Intel Xeon Performance
Write a Comment
User Comments (0)
About PowerShow.com