Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine

About This Presentation
Title:

Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine

Description:

Energy efficient way to express fine-grained parallelism and exploit bandwidth ... Vector instr specify 64 way-parallelism, hardware exec 8-way ... –

Number of Views:79
Avg rating:3.0/5.0
Slides: 23
Provided by: nasaa5
Category:

less

Transcript and Presenter's Notes

Title: Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine


1
Performance Evaluation of Two Emerging Media
Processors VIRAM and Imagine

Leonid Oliker Future Technologies
Group Computational Research Division LBNL www.ner
sc.gov/oliker Sourav Chatterji, Jason Duell,
Manikandan Narayanan
2
Motivation
  • Commodity cache-based SMP clusters perform at
    small of peak for memory intensive problems
    (esp irregular prob)
  • But gap between processor performance and DRAM
    access times continues to grow (60/yr vs. 7/yr)
  • Power and packaging are becoming significant
    bottlenecks
  • Better software is improving some problems
  • ATLAS, FFTW, Sparsity, PHiPAC
  • Alternative arch allow tighter integration of
    proc memoryCan we build HPC systems w/
    high-end media proc tech?
  • VIRAM PIM technology combines embedded DRAM with
    vector coprocessor to exploit large bandwidth
    potential
  • IMAGINE Stream-aware memory supports large
    processing potential of SIMD controlled VLIW
    clusters

3
Motivation
  • General purpose procs badly suited for data
    intensive ops
  • Large caches not useful
  • Low memory bandwidth
  • Superscalar methods of increasing ILP inefficient
  • Power consumption
  • Application-specific ASICs
  • Good, but expensive/slow to design.
  • Solution general purpose memory aware
    processors
  • Large number of ALUs to exploit data-parallelism
  • Huge memory bandwidth to keep ALUs busy
  • Concurrency overlap memory w/ computation

4
VIRAM Overview
  • MIPS core (200 MHz)
  • Main memory system
  • 8 banks w/13 MB of on-chip DRAM
  • Large 6.4 GBytes/s on-chip peak bandwidth
  • Cach-less Vector unit
  • Energy efficient way to express fine-grained
    parallelism and exploit bandwidth
  • Single issue, in order
  • Low power consumption 2.0 W
  • Peak vector performance
  • 1.6/3.2/6.4 Gops
  • 1.6 Gflops (single-precision)
  • Fabricated by IBM Taped-out 02/2003
  • To hide DRAM access load/store, arithmetic
    instructions deeply pipelined (15 stages)
  • We use simulator with Crays vcc compiler

5
VIRAM Vector Lanes
  • Parallel lane design has adv in performance,
    design complex, scalability
  • Each lanes has 2 ALUs ( 1 for FP) and receives
    identical control signal
  • Vector instr specify 64 way-parallelism, hardware
    exec 8-way
  • 8 KB vector register file partitioned into 32
    vector registers
  • Variable data widths 4 lanes 64-bit, 8 lanes for
    32 bit, 16 for 8 bit
  • Data width cut in half, of elems per register
    (and peak) doubles
  • Limitations no 64-bit FP compiler doesnt
    generate fused MADD

6
VIRAM Power Efficiency
  • Comparable performance with lower clock rate
  • Large power/performance advantage for VIRAM from
  • PIM technology, data parallel execution model

7
Stream Processing
Example stereodepth extraction
  • Data and Functional Parallelism
  • High Comp rate
  • Little Data Reuse
  • Producer-Consumer and Spatial locality
  • Ex Multimedia, sign proc, graphics
  • Stream ordered set of records (homogenous,
    arbitrary data type)
  • Stream programming data is streams, compu is
    kernel
  • Kernel loop through all stream elements
    (sequential order)
  • Perform compound (multiword) operation on each
    stream elem
  • Vectors perform single arith op on each vector
    elem (then store in reg)

8
Imagine Overview
  • Vector VLIW processor
  • Coprocessor to off-chip host processor
  • 8 arithmet clusters control in SIMD w/ VLIW instr
  • Central 128KB Stream Register File _at_ 32GB/s
  • SRF can overlap comp with mem (double buff)
  • SRF cab reuse intermed results (prod-cons local)
  • Stream-aware mem sys with 2.7 GB/s off-chip
  • 544 GB/s interclustr comm
  • Host sends inst to stream controller, SC issues
    commands to on-chip modules

9
Imagine Arithmetic Clusters
  • 400 MHz clock, 8 clusters w/ 6 FU each (48 FU
    total)
  • Reads/writes streams to SRF
  • Each cluster 3 ADD, 2 MULT, 1 DIV/SQRT, 1
    scratch, 1 comm unit
  • 32 bit arch subword operations support 16 and 8
    bit data (no 64 bit support)
  • Local registers on functional units hold 16 words
    each (total 1.5 KB)
  • Clusters receive VLIW-style instructions
    broadcast from microcontroller.

10
VIRAM and Imagine
  • Imagine order of magnitude higher performance
  • VIRAM twice mem bandwidth, less power consumption
  • Notice peak Flop/Word ratios

11
SQMAT Architectural Probe
3x3 Matrix Multiply
  • Sqmat scalable synthetic probe, control comput
    intensity, vector len
  • Imagine stream model req large of ops per word
    to amortize mem refPoor use of SRF, no
    producer-consumer locality
  • Long stream helps hide mem latency but only 7 of
    algorithmic peak
  • VIRAM performs well for low op/word (40 when
    L256)
  • Vector pipeline overlap comp/mem, on-chip DRAM
    (hi bdwth, low laten)

12
SQMAT Performance Crossover
  • Large number of ops/word N10 where N3x3
  • Crossover point L64 (cycles) , L 256 (MFlop)
  • Imagine power becomes apparent almost 4x VIRAM at
    L1024Codes at this end of spectrum greatly
    benefit from Imagine arch

13
VIRAM/Imagine Optimization
  • Optimization strat speed up slower of comp or
    mem
  • Restructure computation for better kernel perform
  • Mem is waiting for ALUS
  • Add more computation for better memory perform
  • ALU memory starved
  • Subtle overlap effectsvect chaining, stream
    doub buff
  • Example optimization RGB?YIQ conversion from
    EEMBC
  • Input format R1G1B1R2G2R2R3G3B3
  • Required format R1R2R3 G1G2G3 B1B2B3.

14
VIRAM RGB?YIQ Optimization
  • VIRAM poor memory performance
  • Strided accesses (1/2 performance)
  • - RGBRGBRGB -- strided loads ? RRRGGGBBB
  • - Only 4 address generators for 8 addresses
    (sufficient for 64 bit)
  • Word operations on byte data (1/4th performance)
  • Optimization replace strided w/ unit access,
    using in-register shuffle
  • Increased computational overhead (packing and
    unpacking)

15
VIRAM RGB?YIQResults
  • Used functional units instead of memory to
    extract components, increasing the computational
    overhead

16
Imagine RGB?YIQ Optimization
  • Imagine bottleneck is comp due poor ALU schedule
    (left)
  • Unoptimized 15 cycles per pixel
  • Software pipelining makes VLIW schedule denser
    (right)
  • Optimized 8 cycles per pixel

17
Imagine RGB?YIQResults
Optimized kernel takes only ½ the cycles per
element Memory is now the new bottleneck
18
EEMBC Benchmark
  • Vec-add one add/elem, perf limited by memory
    system
  • RGB ?(YIQ,CMYK) VIRAM limited by processing
    (cannot use avail bdwidth)
  • Grayfiler Difficult to efficiently impl on
    Imagine (sliding 3x3 window)
  • Autocorr Uses short streams, Imagine host
    latency is high

19
Scientific KernelsSPMV Performance
  • Algorithmic peak VIRAM 8 ops/cycle, Imag 32
    ops/cycle
  • LSHAPE finite element matrix, LARGEDIS
    pseudo-random nnz
  • Imagine lacks irreg access, reorder matrix before
    kernelC
  • VIRAM better suited for this class of apps (low
    comp/mem)

20
Scientific KernelsComplex QR Decomposition
  • AQR Q orthrog A upper triag,
  • Blocked Househoulder variant rich in level 3
    BLAS ops
  • Complex elems increases ops/word locality (1
    MUL 6 ops)
  • VIRAM uses CLAPACK port (insertion of vector
    directives)
  • Imagine complex indexing of matrix stream (each
    iter smaller matrix)
  • Imagine over 10GFlops (19x VIRAM) well suited
    for this archLow VIRAM perf due strided access
    and compiler limitations

21
Overview
  • Significantly different balance of memory
    organization
  • Relative performance depends on computational
    intensity
  • Programming complexity is high for both
    approaches, although VIRAM is based on
    established vector technology
  • For well-suited applications IMAGINE processor
    can sustain over 10GFlop/s (simulated results)
  • Large homogeneous computation required to
    sufficiently saturate IMAGINE while VIRAM can
    operate on small vector sizes
  • IMAGINE can take advantage of producer-consumer
    locality
  • Both present significant reduction in power and
    space
  • May be used as coprocessors in future generation
    architectures

22
Next Generation
  • CODE next generation of VIRAM
  • More functional units/ faster clock speed
  • Local registers per unit instead of single
    register file.
  • Looking more like Imagine
  • Multi VIRAM architecture network interface
    issues?
  • Brook new language for Imagine
  • Eliminate exposure of hardware details ( of
    clusters)
  • Streaming Supercomputer multi Imagine
    configuration
  • Streams can be used for functional/data
    parallelism
  • Currently evaluating DIVA architecture
Write a Comment
User Comments (0)
About PowerShow.com