Title: Exploring the Design Space of LUT-based Transparent Accelerators
1Exploring the Design Space of LUT-based
Transparent Accelerators
- Sami Yehia, Nathan Clark?, Scott Mahlke?, and
Krisztian Flautner - ARM Ltd.
- ?Advanced Computer Architecture Lab, University
of Michigan
CASES 2005, September 24-27
2Embedded Products Convergence
- Needs of performance for increasing application
demands - Embedded systems win through customization more
performance, low power, etc.. - Traditional ISA customization and hardware
specialization cannot cope with the increase of
functionalities. - One way Transparent Instruction Set
Customization
3Transparent Instruction Set Customization
- An alternative way to performance
- No ISA (or minor) change
- Baseline CPU unchanged
- Hardware generates control
- Eases software burden
- Forward compatible
4Architecture Framework
Subgraph Execution Unit
Application
Subgraph
Inputs
Outputs
BRL
Standard Pipeline
Compiler
Instructions
Control Generation
Augments Instruction Stream
5Pipeline Interface
6LUT-based accelerator
r1
r2
r4
r5
AND
EOR
inst1 EOR r6,r1,r2 inst2 AND r7,r4,r5 inst3
ORR r12,r6,r7
inst1 ADD r6,r1,r2
ORR
r6i r1i ? r2i ? Cini-1 Cini r1i.r2i
Cini-1.(r1i ? r2i)
r12
r5 r4 r2 r1 (ab) (cd)
0 0 0 0 0
0 0 0 1 1
0 0 1 0 1
0 0 1 1 0
0 1 0 0 0
0 1 0 1 1
0 1 1 0 1
0 1 1 1 0
1 0 0 0 0
1 0 0 1 1
1 0 1 0 1
1 0 1 1 0
1 1 0 0 1
1 1 0 1 1
1 1 1 0 1
1 1 1 1 1
A Carry Generator that is also programmable
7Programmable Carry Functional Unit (PCFU)
8Configuration generation
Meta Register file
m-r3
m-r4
m-r5
p
g
r1
r2
Output
Cin
p
g
in1
in2
1 0 0 0 1 0 0 0
0 1 1 1 1 0 0 0
0 1 0 1 0 1 0 1
0 0 1 1 0 0 1 1
0 0 0 0 1 1 1 1
0 0 0 1 0 0 0 1
1 0 0 1 0 1 1 0
0 0 0 1 1 1 1 0
0 1 1 0 0 1 1 0
1 0 0 0 1 0 0 0
Meta Function Unit
r3
r4
Out A ? B ? cin g A.B p A ? B
Out A AND B
Out A ? B
Subgraph
LUT(r3) LUT (r1) AND LUT (r2)
AND r3, r1, r2 ADD r4, r1, r2 XOR r5, r3, r4
9Design Space
- Number of Inputs
- Number of Outputs
- Number of Addition/Subtractions
- Shift support
- At inputs
- At outputs
g1 LUT p1 LUT
16
16
32
in1
32
in2
32
in3
32
in4
p1
g1
32
32
g2 LUT p2 LUT
32
32
Carry
Generator
32
in1
32
in2
32
in3
32
in4
32
cin1
p2
g2
32
32
Carry
Generator
32
cin2
OutLUT
64
32
in1
32
in2
32
in3
32
in4
32
10Evaluation
- Ported Trimaran compiler to ARM ISA
- Subgraph identification engine
- Synthesized with Synopsis standard cell library
at 0.13µ - SimpleScalar configured as ARM926EJ-S
- 5 stage pipe, 250 MHz
- 1 cycle 16k I/D caches
- Single issue
- Baseline 1 cycle subgraph execution latency
11Speedup Baseline PCFU
- 4-inputs, 2-outputs PCFU design
12Number of inputs/outputs
Area is proportional
13Number of addition/subtractions
14Collapsing Emulation
15Shift support
16Design points
4I, 2O, 2A, None
5I, 3O, 2A, None
4I, 3O, 2A, None
17Conclusions
- Transparent Instruction Set Customization needs
- Extracting computations from program
- Efficient Substrate to Map subgraphs
- PCFU LUT Based accelerators
- Flexible configurable accelerators
- Efficient configuration
- You can get up to 66 with a 6 input / 3 out / 2
Adder PCFU - ... but you get 62 with a 8 time smaller, 40
faster PCFU
18Q A
19Backups
20PCFU Design Space
Latency (ns) Area (cells) Speedup
CCA(Michigan) 4.32 278748 1.8
CCA (RD) 7.07 606345 1.8
PCFU (2 AS/4IN/2OUT) 4.2 171305 1.62
PCFU Logic only 0.59 26007 1.18
PCFU 1 ADD 2.15 63603 1.33
PCFU 2 ADD 3.79 134637 1.62
PCFU 3 ADD 5.82 274939 1.63
PCFU 2 IN 3.03 52437 1.49
PCFU 3 IN 3.24 68846 1.56
PCFU 4 IN 3.79 134637 1.62
PCFU 5 IN 5.25 214885 1.63
PCFU 6 IN 5.47 465630 1.63
PCFU (1 OUT) 3.79 134637 1.45
PCFU (2 OUT) 4.2 171305 1.62
PCFU (3 OUT) 4.57 230189 1.63
PCFU (Shift at inputs) 5.02 170529 1.75
PCFU(Shift at outputs) 4.45 158009 1.74
21LUT-based accelerator
r1
r2
ADD r4,r1,r2 XOR r5,r3,r4
r3
r4
?
r5
r5i r3i ? (r1i ? r2i ? cini-1) cini (r1i.r2i)
OR (r1i ? r2i).cini-1
- Closer to FPGA
- Bit level functions too complex
- Proposed Ripple Carry Scheme too slow
- May involve carry propagation network very
complex also - Hard to configure and have a reasonable latency
in a GPP
Cini-1 r3i r2i r1i r5i
0 0 0 0 0
0 0 0 1 1
0 0 1 0 1
0 0 1 1 0
0 1 0 0 1
0 1 0 1 0
0 1 1 0 0
0 1 1 1 1
1 0 0 0 1
1 0 0 1 0
1 0 1 0 0
1 0 1 1 1
1 1 0 0 0
1 1 0 1 1
1 1 1 0 0
1 1 1 1 0