Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine

About This Presentation

Title:

Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine

Description:

Energy efficient way to express fine-grained parallelism and exploit bandwidth ... Vector instr specify 64 way-parallelism, hardware exec 8-way ... –

Number of Views:79

Avg rating:3.0/5.0

Slides: 23

Provided by: nasaa5

Category:

more less

Transcript and Presenter's Notes

Title: Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine

1
Performance Evaluation of Two Emerging Media
Processors VIRAM and Imagine

Leonid Oliker Future Technologies
Group Computational Research Division LBNL www.ner
sc.gov/oliker Sourav Chatterji, Jason Duell,
Manikandan Narayanan
2
Motivation

Commodity cache-based SMP clusters perform at
small of peak for memory intensive problems
(esp irregular prob)
But gap between processor performance and DRAM
access times continues to grow (60/yr vs. 7/yr)
Power and packaging are becoming significant
bottlenecks
Better software is improving some problems
ATLAS, FFTW, Sparsity, PHiPAC
Alternative arch allow tighter integration of
proc memoryCan we build HPC systems w/
high-end media proc tech?
VIRAM PIM technology combines embedded DRAM with
vector coprocessor to exploit large bandwidth
potential
IMAGINE Stream-aware memory supports large
processing potential of SIMD controlled VLIW
clusters

3
Motivation

General purpose procs badly suited for data
intensive ops
Large caches not useful
Low memory bandwidth
Superscalar methods of increasing ILP inefficient
Power consumption
Application-specific ASICs
Good, but expensive/slow to design.
Solution general purpose memory aware
processors
Large number of ALUs to exploit data-parallelism
Huge memory bandwidth to keep ALUs busy
Concurrency overlap memory w/ computation

4
VIRAM Overview

MIPS core (200 MHz)
Main memory system
8 banks w/13 MB of on-chip DRAM
Large 6.4 GBytes/s on-chip peak bandwidth
Cach-less Vector unit
Energy efficient way to express fine-grained
parallelism and exploit bandwidth
Single issue, in order
Low power consumption 2.0 W
Peak vector performance
1.6/3.2/6.4 Gops
1.6 Gflops (single-precision)
Fabricated by IBM Taped-out 02/2003
To hide DRAM access load/store, arithmetic
instructions deeply pipelined (15 stages)
We use simulator with Crays vcc compiler

5
VIRAM Vector Lanes

Parallel lane design has adv in performance,
design complex, scalability
Each lanes has 2 ALUs ( 1 for FP) and receives
identical control signal
Vector instr specify 64 way-parallelism, hardware
exec 8-way
8 KB vector register file partitioned into 32
vector registers
Variable data widths 4 lanes 64-bit, 8 lanes for
32 bit, 16 for 8 bit
Data width cut in half, of elems per register
(and peak) doubles
Limitations no 64-bit FP compiler doesnt
generate fused MADD

6
VIRAM Power Efficiency

Comparable performance with lower clock rate
Large power/performance advantage for VIRAM from
PIM technology, data parallel execution model

7
Stream Processing
Example stereodepth extraction

Data and Functional Parallelism
High Comp rate
Little Data Reuse
Producer-Consumer and Spatial locality
Ex Multimedia, sign proc, graphics

Stream ordered set of records (homogenous,
arbitrary data type)
Stream programming data is streams, compu is
kernel
Kernel loop through all stream elements
(sequential order)
Perform compound (multiword) operation on each
stream elem
Vectors perform single arith op on each vector
elem (then store in reg)

8
Imagine Overview

Vector VLIW processor
Coprocessor to off-chip host processor
8 arithmet clusters control in SIMD w/ VLIW instr
Central 128KB Stream Register File _at_ 32GB/s
SRF can overlap comp with mem (double buff)
SRF cab reuse intermed results (prod-cons local)
Stream-aware mem sys with 2.7 GB/s off-chip
544 GB/s interclustr comm

Host sends inst to stream controller, SC issues
commands to on-chip modules

9
Imagine Arithmetic Clusters

400 MHz clock, 8 clusters w/ 6 FU each (48 FU
total)
Reads/writes streams to SRF
Each cluster 3 ADD, 2 MULT, 1 DIV/SQRT, 1
scratch, 1 comm unit
32 bit arch subword operations support 16 and 8
bit data (no 64 bit support)
Local registers on functional units hold 16 words
each (total 1.5 KB)
Clusters receive VLIW-style instructions
broadcast from microcontroller.

10
VIRAM and Imagine

Imagine order of magnitude higher performance
VIRAM twice mem bandwidth, less power consumption
Notice peak Flop/Word ratios

11
SQMAT Architectural Probe
3x3 Matrix Multiply

Sqmat scalable synthetic probe, control comput
intensity, vector len
Imagine stream model req large of ops per word
to amortize mem refPoor use of SRF, no
producer-consumer locality
Long stream helps hide mem latency but only 7 of
algorithmic peak
VIRAM performs well for low op/word (40 when
L256)
Vector pipeline overlap comp/mem, on-chip DRAM
(hi bdwth, low laten)

12
SQMAT Performance Crossover

Large number of ops/word N10 where N3x3
Crossover point L64 (cycles) , L 256 (MFlop)
Imagine power becomes apparent almost 4x VIRAM at
L1024Codes at this end of spectrum greatly
benefit from Imagine arch

13
VIRAM/Imagine Optimization

Optimization strat speed up slower of comp or
mem
Restructure computation for better kernel perform
Mem is waiting for ALUS
Add more computation for better memory perform
ALU memory starved
Subtle overlap effectsvect chaining, stream
doub buff

Example optimization RGB?YIQ conversion from
EEMBC
Input format R1G1B1R2G2R2R3G3B3
Required format R1R2R3 G1G2G3 B1B2B3.

14
VIRAM RGB?YIQ Optimization

VIRAM poor memory performance
Strided accesses (1/2 performance)
- RGBRGBRGB -- strided loads ? RRRGGGBBB
- Only 4 address generators for 8 addresses
(sufficient for 64 bit)
Word operations on byte data (1/4th performance)
Optimization replace strided w/ unit access,
using in-register shuffle
Increased computational overhead (packing and
unpacking)

15
VIRAM RGB?YIQResults

Used functional units instead of memory to
extract components, increasing the computational
overhead

16
Imagine RGB?YIQ Optimization

Imagine bottleneck is comp due poor ALU schedule
(left)
Unoptimized 15 cycles per pixel
Software pipelining makes VLIW schedule denser
(right)
Optimized 8 cycles per pixel

17
Imagine RGB?YIQResults
Optimized kernel takes only ½ the cycles per
element Memory is now the new bottleneck
18
EEMBC Benchmark

Vec-add one add/elem, perf limited by memory
system
RGB ?(YIQ,CMYK) VIRAM limited by processing
(cannot use avail bdwidth)
Grayfiler Difficult to efficiently impl on
Imagine (sliding 3x3 window)
Autocorr Uses short streams, Imagine host
latency is high

19
Scientific KernelsSPMV Performance

Algorithmic peak VIRAM 8 ops/cycle, Imag 32
ops/cycle
LSHAPE finite element matrix, LARGEDIS
pseudo-random nnz
Imagine lacks irreg access, reorder matrix before
kernelC
VIRAM better suited for this class of apps (low
comp/mem)

20
Scientific KernelsComplex QR Decomposition

AQR Q orthrog A upper triag,
Blocked Househoulder variant rich in level 3
BLAS ops
Complex elems increases ops/word locality (1
MUL 6 ops)
VIRAM uses CLAPACK port (insertion of vector
directives)
Imagine complex indexing of matrix stream (each
iter smaller matrix)
Imagine over 10GFlops (19x VIRAM) well suited
for this archLow VIRAM perf due strided access
and compiler limitations

21
Overview

Significantly different balance of memory
organization
Relative performance depends on computational
intensity
Programming complexity is high for both
approaches, although VIRAM is based on
established vector technology
For well-suited applications IMAGINE processor
can sustain over 10GFlop/s (simulated results)
Large homogeneous computation required to
sufficiently saturate IMAGINE while VIRAM can
operate on small vector sizes
IMAGINE can take advantage of producer-consumer
locality
Both present significant reduction in power and
space
May be used as coprocessors in future generation
architectures