Stream Architecture: Rethinking Media Processor Design - PowerPoint PPT Presentation

About This Presentation
Title:

Stream Architecture: Rethinking Media Processor Design

Description:

SIMD Register Organization. Area, Power N3/C2, Delay (N/C)3/2. Scott Rixner ... Stream Register Organization. Efficiency of special-purpose hardware ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 38
Provided by: andrew638
Category:

less

Transcript and Presenter's Notes

Title: Stream Architecture: Rethinking Media Processor Design


1
Stream ArchitectureRethinking Media Processor
Design
Scott Rixner April 9, 2001
  • Rice University
  • Computer Systems Laboratory

2
Media Processing
  • Video/image compression decompression
  • MPEG, JPEG, ...
  • Signal Processing
  • DSL modems, cellular base stations, ...
  • Image synthesis
  • Polygon rendering, image-based rendering, ...
  • Image understanding
  • Face recognition, depth extraction, ...

3
Stereo Depth Extraction
Left Camera Image
Right Camera Image
  • 640x480 _at_ 30 fps
  • Requirements
  • 11 GOPS
  • Imagine stream processor
  • 12.1 GOPS, 4.6 GOPS/W

Depth Map
4
Outline
  • Stream Processing
  • VLSI Constraints
  • Register Organization
  • Imagine
  • Conclusions

5
Media Processing Characteristics
  • Low-precision data
  • 24 8-bit integer operations
  • 29 16-bit integer operations
  • Abundant data-parallelism
  • Little global data reuse
  • Average of 1.5 references per global data word
  • Numerous computations per global reference
  • 50-500 operations per global data reference

6
Stream Processing
  • Little data reuse (pixels never revisited)
  • Highly data parallel (output pixels not dependent
    on other output pixels)
  • Compute intensive (gt60 operations per memory
    reference)

7
Locality and Concurrency
Operations within a kernel operate on local data
Kernels can be partitioned across chips to
exploit control parallelism
Image 0
convolve
convolve
Depth Map
SAD
Image 1
convolve
convolve
Streams expose data parallelism
8
Sony PlayStation2
Emotion Engine
FPU
MIPS Core
VPU0
Graphics Synthesizer
VPU1
Display
IPU
RDRAM, I/O, DMAC, etc.
9
Special vs. General Purpose
  • Special Purpose
  • Fixed function
  • High performance
  • General Purpose
  • Programmable
  • Insufficient performance

10
Register Files Dwarf ALUs
11
Register File Area
  • Each cell requires
  • 1 word line per port
  • 1 bit line per port
  • Each cell grows as p2
  • R registers in the file
  • Area p2R µ N3

Register Bit Cell
12
Register File Access Delay
  • Signal must traverse
  • Word line to access cell
  • Bit line to transfer data
  • Wire capacitance dominates
  • Delay pR1/2 µ N3/2

Register File
13
Register File Power Dissipation
  • 100 utilization requires
  • driving all pR1/2 bit lines
  • Wire capacitance dominates
  • Power p2R µ N3

Register File
14
Centralized Register Organization
  • Area, Power µ N3, Delay µ N3/2

15
Partitioned Organizations
  • SIMD
  • Data-parallel axis
  • Distributed Register Files (DRF)
  • Instruction-level parallel axis
  • Hierarchical
  • Memory hierarchy axis
  • Stream
  • Optimizing for streams

16
SIMD Register Organization
  • Area, Power µ N3/C2, Delay µ (N/C)3/2

17
Distributed Register Organization
  • Area, Power µ N2, Delay µ N

18
Combining SIMD and DRF
Scalar
SIMD
Central
DRF
19
Hierarchical Register Organization
Hierarchical T40
  • Area, Power µ N3, Delay µ N3/2

20
Hierarchical Organizations
Scalar
SIMD
Central
DRF
21
Stream Register Organization
  • Area, Power µ N2/C, Delay µ N/C

22
Stream Organizations
Scalar
SIMD
Central
DRF
23
Comparison of Organizations
  • 48 ALUs (32-bit), 500 MHz
  • Stream organization improves central organization
    by
  • Area 195x, Delay 20x, Power 430x

24
Performance
16 Performance Drop (8 with latency constraints)
180x Improvement
25
Stream Architecture
  • Stream Processing
  • Matched to media processing
  • Exposes locality and concurrency
  • Stream Register Organization
  • Efficiency of special-purpose hardware
  • Optimized for streaming applications
  • Data bandwidth
  • Bandwidth hierarchy
  • Memory access scheduling
  • Conditional streams

26
The Imagine Stream Processor
27
Arithmetic Clusters
Communication Unit
Scratch-pad Register File
Intercluster Network
Local Register File





/
CU
To SRF
Cross Point
From SRF
28
Bandwidth Hierarchy
SDRAM
ALU Cluster
ALU Cluster
SDRAM
Stream Register File
SDRAM
SDRAM
ALU Cluster
2GB/s
32GB/s
544GB/s
  • 41.2 32-bit operations per word of memory
    bandwidth

29
Stream Recirculation
30
Bandwidth Demands of FIR Filter
31
Bandwidth Utilization of FIR Filter
32
Performance
floating-point application
16-bit kernels
16-bit applications
floating-point kernel
33
Power
GOPS/W 4.6 6.9 4.1 10.2
9.6 2.4 6.3
34
Relative Performance and Power Efficiency
FFT Performance
Power Efficiency
35
Imagine Floorplan
  • Tapeout Q2 01
  • 21 million Ts
  • 6M SRF SRAM
  • 6M UC SRAM
  • 6M Clusters
  • 3M Other
  • Target 32 FO4
  • 300 MHz at SSSS
  • 500 MHz at TTSS
  • TI GS30KA
  • 0.15 mm Ldrawn
  • 457 Signal Pins

36
Imagine Team
  • William J. Dally
  • Ujval Kapasi
  • Brucek Khailany
  • Peter Mattson
  • Jinyung Namkoong
  • John Owens
  • Ben Serebrin
  • Brian Towles
  • Scott Rixner
  • Don Alpert (Intel)
  • Ghazi Ben Amor
  • Chris Buehler (MIT)
  • JP Grossman (MIT)
  • Brad Johanson
  • Abelardo Lopez-Lagunas
  • Ben Mowery
  • Manman Ren

37
Conclusions
  • Media Processing
  • Little data reuse
  • Highly data parallel
  • Compute intensive
  • VLSI
  • Stream register organization
  • Bandwidth hierarchy
  • Imagine
  • Stream architecture
  • 10 GOPS sustained application performance
  • 5 GOPS/W application power efficiency
Write a Comment
User Comments (0)
About PowerShow.com