Exploring the Design Space of LUT-based Transparent Accelerators presentation

About This Presentation

Title:

Exploring the Design Space of LUT-based Transparent Accelerators

Description:

Exploring the Design Space of LUT-based Transparent Accelerators ... Stereo Headset. Bluetooth/UWB. Biometrics. GPS. TV out. PC / Mac. Memory card ... –

Number of Views:17

Avg rating:3.0/5.0

Slides: 22

Provided by: swal159

Learn more at: https://cccp.eecs.umich.edu

Category:

more less

Transcript and Presenter's Notes

Title: Exploring the Design Space of LUT-based Transparent Accelerators

1
Exploring the Design Space of LUT-based
Transparent Accelerators

Sami Yehia, Nathan Clark?, Scott Mahlke?, and
Krisztian Flautner
ARM Ltd.
?Advanced Computer Architecture Lab, University
of Michigan

CASES 2005, September 24-27
2
Embedded Products Convergence

Needs of performance for increasing application
demands
Embedded systems win through customization more
performance, low power, etc..
Traditional ISA customization and hardware
specialization cannot cope with the increase of
functionalities.
One way Transparent Instruction Set
Customization

3
Transparent Instruction Set Customization

An alternative way to performance

No ISA (or minor) change
Baseline CPU unchanged
Hardware generates control
Eases software burden
Forward compatible

Transparent

4
Architecture Framework
Subgraph Execution Unit
Application
Subgraph
Inputs
Outputs
BRL
Standard Pipeline
Compiler
Instructions
Control Generation
Augments Instruction Stream
5
Pipeline Interface
6
LUT-based accelerator

LUT-Based

Addition/Subtraction

r1
r2
r4
r5
AND
EOR
inst1 EOR r6,r1,r2 inst2 AND r7,r4,r5 inst3
ORR r12,r6,r7
inst1 ADD r6,r1,r2
ORR
r6i r1i ? r2i ? Cini-1 Cini r1i.r2i
Cini-1.(r1i ? r2i)
r12
r5 r4 r2 r1 (ab) (cd)
0 0 0 0 0
0 0 0 1 1
0 0 1 0 1
0 0 1 1 0
0 1 0 0 0
0 1 0 1 1
0 1 1 0 1
0 1 1 1 0
1 0 0 0 0
1 0 0 1 1
1 0 1 0 1
1 0 1 1 0
1 1 0 0 1
1 1 0 1 1
1 1 1 0 1
1 1 1 1 1
A Carry Generator that is also programmable
7
Programmable Carry Functional Unit (PCFU)
8
Configuration generation
Meta Register file

m-r3
m-r4
m-r5

p
g
r1
r2
Output
Cin
p
g
in1
in2
1 0 0 0 1 0 0 0
0 1 1 1 1 0 0 0
0 1 0 1 0 1 0 1
0 0 1 1 0 0 1 1
0 0 0 0 1 1 1 1
0 0 0 1 0 0 0 1
1 0 0 1 0 1 1 0
0 0 0 1 1 1 1 0
0 1 1 0 0 1 1 0
1 0 0 0 1 0 0 0
Meta Function Unit
r3
r4
Out A ? B ? cin g A.B p A ? B
Out A AND B
Out A ? B
Subgraph
LUT(r3) LUT (r1) AND LUT (r2)
AND r3, r1, r2 ADD r4, r1, r2 XOR r5, r3, r4
9
Design Space

Number of Inputs
Number of Outputs
Number of Addition/Subtractions
Shift support
At inputs
At outputs

g1 LUT p1 LUT
16
16
32
in1
32
in2
32
in3
32
in4
p1
g1
32
32
g2 LUT p2 LUT
32
32
Carry
Generator
32
in1
32
in2
32
in3
32
in4
32
cin1
p2
g2
32
32
Carry
Generator
32
cin2
OutLUT
64
32
in1
32
in2
32
in3
32
in4
32
10
Evaluation

Ported Trimaran compiler to ARM ISA
Subgraph identification engine
Synthesized with Synopsis standard cell library
at 0.13µ
SimpleScalar configured as ARM926EJ-S
5 stage pipe, 250 MHz
1 cycle 16k I/D caches
Single issue
Baseline 1 cycle subgraph execution latency

11
Speedup Baseline PCFU

4-inputs, 2-outputs PCFU design

12
Number of inputs/outputs
Area is proportional
13
Number of addition/subtractions
14
Collapsing Emulation
15
Shift support
16
Design points
4I, 2O, 2A, None
5I, 3O, 2A, None
4I, 3O, 2A, None
17
Conclusions

Transparent Instruction Set Customization needs
Extracting computations from program
Efficient Substrate to Map subgraphs
PCFU LUT Based accelerators
Flexible configurable accelerators
Efficient configuration
You can get up to 66 with a 6 input / 3 out / 2
Adder PCFU
... but you get 62 with a 8 time smaller, 40
faster PCFU

18
Q A
19
Backups
20
PCFU Design Space
Latency (ns) Area (cells) Speedup
CCA(Michigan) 4.32 278748 1.8
CCA (RD) 7.07 606345 1.8
PCFU (2 AS/4IN/2OUT) 4.2 171305 1.62
PCFU Logic only 0.59 26007 1.18
PCFU 1 ADD 2.15 63603 1.33
PCFU 2 ADD 3.79 134637 1.62
PCFU 3 ADD 5.82 274939 1.63
PCFU 2 IN 3.03 52437 1.49
PCFU 3 IN 3.24 68846 1.56
PCFU 4 IN 3.79 134637 1.62
PCFU 5 IN 5.25 214885 1.63
PCFU 6 IN 5.47 465630 1.63
PCFU (1 OUT) 3.79 134637 1.45
PCFU (2 OUT) 4.2 171305 1.62
PCFU (3 OUT) 4.57 230189 1.63
PCFU (Shift at inputs) 5.02 170529 1.75
PCFU(Shift at outputs) 4.45 158009 1.74
21
LUT-based accelerator
r1
r2
ADD r4,r1,r2 XOR r5,r3,r4
r3

r4
?
r5
r5i r3i ? (r1i ? r2i ? cini-1) cini (r1i.r2i)
OR (r1i ? r2i).cini-1

Closer to FPGA
Bit level functions too complex
Proposed Ripple Carry Scheme too slow
May involve carry propagation network very
complex also
Hard to configure and have a reasonable latency
in a GPP

Cini-1 r3i r2i r1i r5i
0 0 0 0 0
0 0 0 1 1
0 0 1 0 1
0 0 1 1 0
0 1 0 0 1
0 1 0 1 0
0 1 1 0 0
0 1 1 1 1
1 0 0 0 1
1 0 0 1 0
1 0 1 0 0
1 0 1 1 1
1 1 0 0 0
1 1 0 1 1
1 1 1 0 0
1 1 1 1 0

Write a Comment

User Comments (0)

About PowerShow.com