Title: Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine
1Performance Evaluation of Two Emerging Media
Processors VIRAM and Imagine
Leonid Oliker Future Technologies
Group Computational Research Division LBNL www.ner
sc.gov/oliker Sourav Chatterji, Jason Duell,
Manikandan Narayanan
2Motivation
- Commodity cache-based SMP clusters perform at
small of peak for memory intensive problems
(esp irregular prob) - But gap between processor performance and DRAM
access times continues to grow (60/yr vs. 7/yr) - Power and packaging are becoming significant
bottlenecks - Better software is improving some problems
- ATLAS, FFTW, Sparsity, PHiPAC
- Alternative arch allow tighter integration of
proc memoryCan we build HPC systems w/
high-end media proc tech? - VIRAM PIM technology combines embedded DRAM with
vector coprocessor to exploit large bandwidth
potential - IMAGINE Stream-aware memory supports large
processing potential of SIMD controlled VLIW
clusters
3Motivation
- General purpose procs badly suited for data
intensive ops - Large caches not useful
- Low memory bandwidth
- Superscalar methods of increasing ILP inefficient
- Power consumption
- Application-specific ASICs
- Good, but expensive/slow to design.
- Solution general purpose memory aware
processors - Large number of ALUs to exploit data-parallelism
- Huge memory bandwidth to keep ALUs busy
- Concurrency overlap memory w/ computation
4VIRAM Overview
- MIPS core (200 MHz)
- Main memory system
- 8 banks w/13 MB of on-chip DRAM
- Large 6.4 GBytes/s on-chip peak bandwidth
- Cach-less Vector unit
- Energy efficient way to express fine-grained
parallelism and exploit bandwidth - Single issue, in order
- Low power consumption 2.0 W
- Peak vector performance
- 1.6/3.2/6.4 Gops
- 1.6 Gflops (single-precision)
- Fabricated by IBM Taped-out 02/2003
- To hide DRAM access load/store, arithmetic
instructions deeply pipelined (15 stages) - We use simulator with Crays vcc compiler
5VIRAM Vector Lanes
- Parallel lane design has adv in performance,
design complex, scalability - Each lanes has 2 ALUs ( 1 for FP) and receives
identical control signal - Vector instr specify 64 way-parallelism, hardware
exec 8-way - 8 KB vector register file partitioned into 32
vector registers - Variable data widths 4 lanes 64-bit, 8 lanes for
32 bit, 16 for 8 bit - Data width cut in half, of elems per register
(and peak) doubles - Limitations no 64-bit FP compiler doesnt
generate fused MADD
6VIRAM Power Efficiency
- Comparable performance with lower clock rate
- Large power/performance advantage for VIRAM from
- PIM technology, data parallel execution model
7Stream Processing
Example stereodepth extraction
- Data and Functional Parallelism
- High Comp rate
- Little Data Reuse
- Producer-Consumer and Spatial locality
- Ex Multimedia, sign proc, graphics
- Stream ordered set of records (homogenous,
arbitrary data type) - Stream programming data is streams, compu is
kernel - Kernel loop through all stream elements
(sequential order) - Perform compound (multiword) operation on each
stream elem - Vectors perform single arith op on each vector
elem (then store in reg)
8Imagine Overview
- Vector VLIW processor
- Coprocessor to off-chip host processor
- 8 arithmet clusters control in SIMD w/ VLIW instr
- Central 128KB Stream Register File _at_ 32GB/s
- SRF can overlap comp with mem (double buff)
- SRF cab reuse intermed results (prod-cons local)
- Stream-aware mem sys with 2.7 GB/s off-chip
- 544 GB/s interclustr comm
- Host sends inst to stream controller, SC issues
commands to on-chip modules
9Imagine Arithmetic Clusters
- 400 MHz clock, 8 clusters w/ 6 FU each (48 FU
total) - Reads/writes streams to SRF
- Each cluster 3 ADD, 2 MULT, 1 DIV/SQRT, 1
scratch, 1 comm unit - 32 bit arch subword operations support 16 and 8
bit data (no 64 bit support) - Local registers on functional units hold 16 words
each (total 1.5 KB) - Clusters receive VLIW-style instructions
broadcast from microcontroller.
10VIRAM and Imagine
- Imagine order of magnitude higher performance
- VIRAM twice mem bandwidth, less power consumption
- Notice peak Flop/Word ratios
11SQMAT Architectural Probe
3x3 Matrix Multiply
- Sqmat scalable synthetic probe, control comput
intensity, vector len - Imagine stream model req large of ops per word
to amortize mem refPoor use of SRF, no
producer-consumer locality - Long stream helps hide mem latency but only 7 of
algorithmic peak - VIRAM performs well for low op/word (40 when
L256) - Vector pipeline overlap comp/mem, on-chip DRAM
(hi bdwth, low laten)
12SQMAT Performance Crossover
- Large number of ops/word N10 where N3x3
- Crossover point L64 (cycles) , L 256 (MFlop)
- Imagine power becomes apparent almost 4x VIRAM at
L1024Codes at this end of spectrum greatly
benefit from Imagine arch
13VIRAM/Imagine Optimization
- Optimization strat speed up slower of comp or
mem - Restructure computation for better kernel perform
- Mem is waiting for ALUS
- Add more computation for better memory perform
- ALU memory starved
- Subtle overlap effectsvect chaining, stream
doub buff
- Example optimization RGB?YIQ conversion from
EEMBC - Input format R1G1B1R2G2R2R3G3B3
- Required format R1R2R3 G1G2G3 B1B2B3.
14VIRAM RGB?YIQ Optimization
- VIRAM poor memory performance
- Strided accesses (1/2 performance)
- - RGBRGBRGB -- strided loads ? RRRGGGBBB
- - Only 4 address generators for 8 addresses
(sufficient for 64 bit) - Word operations on byte data (1/4th performance)
- Optimization replace strided w/ unit access,
using in-register shuffle - Increased computational overhead (packing and
unpacking)
15VIRAM RGB?YIQResults
- Used functional units instead of memory to
extract components, increasing the computational
overhead
16Imagine RGB?YIQ Optimization
- Imagine bottleneck is comp due poor ALU schedule
(left) - Unoptimized 15 cycles per pixel
- Software pipelining makes VLIW schedule denser
(right) - Optimized 8 cycles per pixel
17Imagine RGB?YIQResults
Optimized kernel takes only ½ the cycles per
element Memory is now the new bottleneck
18EEMBC Benchmark
- Vec-add one add/elem, perf limited by memory
system - RGB ?(YIQ,CMYK) VIRAM limited by processing
(cannot use avail bdwidth) - Grayfiler Difficult to efficiently impl on
Imagine (sliding 3x3 window) - Autocorr Uses short streams, Imagine host
latency is high
19Scientific KernelsSPMV Performance
- Algorithmic peak VIRAM 8 ops/cycle, Imag 32
ops/cycle - LSHAPE finite element matrix, LARGEDIS
pseudo-random nnz - Imagine lacks irreg access, reorder matrix before
kernelC - VIRAM better suited for this class of apps (low
comp/mem)
20Scientific KernelsComplex QR Decomposition
- AQR Q orthrog A upper triag,
- Blocked Househoulder variant rich in level 3
BLAS ops - Complex elems increases ops/word locality (1
MUL 6 ops) - VIRAM uses CLAPACK port (insertion of vector
directives) - Imagine complex indexing of matrix stream (each
iter smaller matrix) - Imagine over 10GFlops (19x VIRAM) well suited
for this archLow VIRAM perf due strided access
and compiler limitations
21Overview
- Significantly different balance of memory
organization - Relative performance depends on computational
intensity - Programming complexity is high for both
approaches, although VIRAM is based on
established vector technology - For well-suited applications IMAGINE processor
can sustain over 10GFlop/s (simulated results) - Large homogeneous computation required to
sufficiently saturate IMAGINE while VIRAM can
operate on small vector sizes - IMAGINE can take advantage of producer-consumer
locality - Both present significant reduction in power and
space - May be used as coprocessors in future generation
architectures
22Next Generation
- CODE next generation of VIRAM
- More functional units/ faster clock speed
- Local registers per unit instead of single
register file. - Looking more like Imagine
- Multi VIRAM architecture network interface
issues? - Brook new language for Imagine
- Eliminate exposure of hardware details ( of
clusters) - Streaming Supercomputer multi Imagine
configuration - Streams can be used for functional/data
parallelism - Currently evaluating DIVA architecture