Title: Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping
1Liquid SIMD Abstracting SIMD Hardware Using
Lightweight Dynamic Mapping
- Nathan Clark, Amir Hormati, Scott Mahlke,
- Sami Yehia, Krisztián Flautner
- University of Michigan ARM
Ltd.
2Computational Efficiency
- Low power envelope
- More useful work/transistors
- Hardware accelerators
- Niagara II encryption engine
Source AMD Analyst Day 12/14/06
3How Are Accelerators Used?
- Control statically placed in binary
4Problem With Static Control
CPU
- Not forward/backward compatible
5Solution Virtualization
- Statically identify accelerated computation
- Abstract accelerator features
- Dynamically retarget binary
6Liquid SIMD
- Virtualize SIMD accelerators
- Why virtualize SIMD?
- Intel MMX to SSE2
- ARM v6 to Neon
- Wide vectors useful Lin 06
7SIMD Accelerator Assumptions
SIMD Exec
Fetch
Decode
Retire
Scalar Exec
- Same instruction stream
- Separate pipeline memory interface
8How to Virtualize
- Use scalar ISA to represent SIMD operations
- Compatibility, low overhead
- Key easy to translate
Program
Branch
9Virtualization Architecture
101. Data Parallel Operations
for(i 0 i lt 8 i) r1 Ai r2
Bi r3 r1 r2 r4 r3 constant
Ci r4
C
111a. What If Theres No Scalar Equivalent?
for(i 0 i lt 8 i) r1 Ai r2
Bi r3 r1 r2 cmp r3, FF r3
movgt FF ...
Idioms can always be constructed
122. Scalarizing Permutations
for(i 0 i lt 8 i) r1 r2 r3
tmpi r1 for(i 0 i lt 8 i) r1
offseti r2 tmpr1 i r3 r2
const
offset 4, 4, 4, 4, -4, -4, -4, -4
offset 4, 4, 4, 4, -4, -4, -4, -4
offset 4, 4, 4, 4, -4, -4, -4, -4
133. Scalarizing Reductions
for(i 0 i lt 8 i) r1 Ai r2
r2 r1
14Applied to ARM Neon
- All instructions supported except
- VTBL indirect indexing
- v1 vtbl v2, v3
- Interleaved memory accesses
- Not needed in evaluated benchmarks
15Translation to SIMD
for(i 0 i lt 8 i) r1 Ai r2
Bi r3 r1 r2 r4 offseti Ci
r4 r3
for(i 0 i lt 8 i 4) v1 Ai v2
Bi v3 v1 v2 v4 v3 constant
for(i 0 i lt 8 i 4) v1 Ai v2
Bi v3 v1 v2 v4
i 4
for(i 0 i lt 8 i 4) v1 Ai v2
Bi v3 v1 v2 v4 offseti
for(i 0 i lt 8 i 4) v1 Ai v2
Bi v3 v1 v2 v3 shuffle v3
Ci v3
- Update induction variable
- Use inverse of defined translation rules
16Translator Design
- Translator efficiency, speed, flexibility
17Evaluation
- Trimaran ARM
- Hand SIMDized loops
- SimpleScalar model ARM926 w/ Neon SIMD
- VHDL translator, 130nm std. cell
18Liquid SIMD Issues
- Code bloat
- lt1 overhead beyond baseline
- Register pressure
- Not a problem
- Translator cost
- 0.2 mm2 2KB cache
- Translation overhead
19Translation Overhead
MediaBench
Kernels
SPECfp
20Summary
- Accelerators are more common and evolving
- Costly binary migration
- SIMD virtualization using scalar ISA
- One binary forward/backward compatibility
- Negligible overhead
21Questions
?
?
?
?
?
?
?
?
?
?
?