Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping

Description:

Electrical Engineering and Computer Science ... Electrical Engineering and Computer Science. Use scalar ISA to represent SIMD operations ... – PowerPoint PPT presentation

Number of Views:146
Avg rating:3.0/5.0
Slides: 22
Provided by: Kevin4
Category:

less

Transcript and Presenter's Notes

Title: Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping


1
Liquid SIMD Abstracting SIMD Hardware Using
Lightweight Dynamic Mapping
  • Nathan Clark, Amir Hormati, Scott Mahlke,
  • Sami Yehia, Krisztián Flautner
  • University of Michigan ARM
    Ltd.

2
Computational Efficiency
  • Low power envelope
  • More useful work/transistors
  • Hardware accelerators
  • Niagara II encryption engine

Source AMD Analyst Day 12/14/06
3
How Are Accelerators Used?
  • Control statically placed in binary

4
Problem With Static Control
CPU
  • Not forward/backward compatible

5
Solution Virtualization
  • Statically identify accelerated computation
  • Abstract accelerator features
  • Dynamically retarget binary

6
Liquid SIMD
  • Virtualize SIMD accelerators
  • Why virtualize SIMD?
  • Intel MMX to SSE2
  • ARM v6 to Neon
  • Wide vectors useful Lin 06

7
SIMD Accelerator Assumptions
SIMD Exec
Fetch
Decode
Retire
Scalar Exec
  • Same instruction stream
  • Separate pipeline memory interface

8
How to Virtualize
  • Use scalar ISA to represent SIMD operations
  • Compatibility, low overhead
  • Key easy to translate

Program
Branch
9
Virtualization Architecture
10
1. Data Parallel Operations
for(i 0 i lt 8 i) r1 Ai r2
Bi r3 r1 r2 r4 r3 constant
Ci r4
C
11
1a. What If Theres No Scalar Equivalent?
for(i 0 i lt 8 i) r1 Ai r2
Bi r3 r1 r2 cmp r3, FF r3
movgt FF ...
Idioms can always be constructed
12
2. Scalarizing Permutations
for(i 0 i lt 8 i) r1 r2 r3
tmpi r1 for(i 0 i lt 8 i) r1
offseti r2 tmpr1 i r3 r2
const
offset 4, 4, 4, 4, -4, -4, -4, -4
offset 4, 4, 4, 4, -4, -4, -4, -4
offset 4, 4, 4, 4, -4, -4, -4, -4
13
3. Scalarizing Reductions
for(i 0 i lt 8 i) r1 Ai r2
r2 r1
14
Applied to ARM Neon
  • All instructions supported except
  • VTBL indirect indexing
  • v1 vtbl v2, v3
  • Interleaved memory accesses
  • Not needed in evaluated benchmarks

15
Translation to SIMD
for(i 0 i lt 8 i) r1 Ai r2
Bi r3 r1 r2 r4 offseti Ci
r4 r3
for(i 0 i lt 8 i 4) v1 Ai v2
Bi v3 v1 v2 v4 v3 constant
for(i 0 i lt 8 i 4) v1 Ai v2
Bi v3 v1 v2 v4
i 4
for(i 0 i lt 8 i 4) v1 Ai v2
Bi v3 v1 v2 v4 offseti
for(i 0 i lt 8 i 4) v1 Ai v2
Bi v3 v1 v2 v3 shuffle v3
Ci v3
  • Update induction variable
  • Use inverse of defined translation rules

16
Translator Design
  • Translator efficiency, speed, flexibility

17
Evaluation
  • Trimaran ARM
  • Hand SIMDized loops
  • SimpleScalar model ARM926 w/ Neon SIMD
  • VHDL translator, 130nm std. cell

18
Liquid SIMD Issues
  • Code bloat
  • lt1 overhead beyond baseline
  • Register pressure
  • Not a problem
  • Translator cost
  • 0.2 mm2 2KB cache
  • Translation overhead

19
Translation Overhead
MediaBench
Kernels
SPECfp
20
Summary
  • Accelerators are more common and evolving
  • Costly binary migration
  • SIMD virtualization using scalar ISA
  • One binary forward/backward compatibility
  • Negligible overhead

21
Questions
?
?
?
?
  • ?

?
?
?
?
?
?
?
Write a Comment
User Comments (0)
About PowerShow.com