Exploring the Design Space of LUT-based Transparent Accelerators

About This Presentation
Title:

Exploring the Design Space of LUT-based Transparent Accelerators

Description:

Exploring the Design Space of LUT-based Transparent Accelerators ... Stereo Headset. Bluetooth/UWB. Biometrics. GPS. TV out. PC / Mac. Memory card ... –

Number of Views:17
Avg rating:3.0/5.0
Slides: 22
Provided by: swal159
Category:

less

Transcript and Presenter's Notes

Title: Exploring the Design Space of LUT-based Transparent Accelerators


1
Exploring the Design Space of LUT-based
Transparent Accelerators
  • Sami Yehia, Nathan Clark?, Scott Mahlke?, and
    Krisztian Flautner
  • ARM Ltd.
  • ?Advanced Computer Architecture Lab, University
    of Michigan

CASES 2005, September 24-27
2
Embedded Products Convergence
  • Needs of performance for increasing application
    demands
  • Embedded systems win through customization more
    performance, low power, etc..
  • Traditional ISA customization and hardware
    specialization cannot cope with the increase of
    functionalities.
  • One way Transparent Instruction Set
    Customization

3
Transparent Instruction Set Customization
  • An alternative way to performance
  • No ISA (or minor) change
  • Baseline CPU unchanged
  • Hardware generates control
  • Eases software burden
  • Forward compatible
  • Transparent

4
Architecture Framework
Subgraph Execution Unit
Application
Subgraph
Inputs
Outputs
BRL
Standard Pipeline
Compiler
Instructions
Control Generation
Augments Instruction Stream
5
Pipeline Interface
6
LUT-based accelerator
  • LUT-Based
  • Addition/Subtraction

r1
r2
r4
r5
AND
EOR
inst1 EOR r6,r1,r2 inst2 AND r7,r4,r5 inst3
ORR r12,r6,r7
inst1 ADD r6,r1,r2
ORR
r6i r1i ? r2i ? Cini-1 Cini r1i.r2i
Cini-1.(r1i ? r2i)
r12
r5 r4 r2 r1 (ab) (cd)
0 0 0 0 0
0 0 0 1 1
0 0 1 0 1
0 0 1 1 0
0 1 0 0 0
0 1 0 1 1
0 1 1 0 1
0 1 1 1 0
1 0 0 0 0
1 0 0 1 1
1 0 1 0 1
1 0 1 1 0
1 1 0 0 1
1 1 0 1 1
1 1 1 0 1
1 1 1 1 1
A Carry Generator that is also programmable
7
Programmable Carry Functional Unit (PCFU)
8
Configuration generation
Meta Register file











m-r3
m-r4
m-r5

p
g
r1
r2
Output
Cin
p
g
in1
in2
1 0 0 0 1 0 0 0
0 1 1 1 1 0 0 0
0 1 0 1 0 1 0 1
0 0 1 1 0 0 1 1
0 0 0 0 1 1 1 1
0 0 0 1 0 0 0 1
1 0 0 1 0 1 1 0
0 0 0 1 1 1 1 0
0 1 1 0 0 1 1 0
1 0 0 0 1 0 0 0
Meta Function Unit
r3
r4
Out A ? B ? cin g A.B p A ? B
Out A AND B
Out A ? B
Subgraph
LUT(r3) LUT (r1) AND LUT (r2)
AND r3, r1, r2 ADD r4, r1, r2 XOR r5, r3, r4
9
Design Space
  • Number of Inputs
  • Number of Outputs
  • Number of Addition/Subtractions
  • Shift support
  • At inputs
  • At outputs

g1 LUT p1 LUT
16
16
32
in1
32
in2
32
in3
32
in4
p1
g1
32
32
g2 LUT p2 LUT
32
32
Carry
Generator
32
in1
32
in2
32
in3
32
in4
32
cin1
p2
g2
32
32
Carry
Generator
32
cin2
OutLUT
64
32
in1
32
in2
32
in3
32
in4
32
10
Evaluation
  • Ported Trimaran compiler to ARM ISA
  • Subgraph identification engine
  • Synthesized with Synopsis standard cell library
    at 0.13µ
  • SimpleScalar configured as ARM926EJ-S
  • 5 stage pipe, 250 MHz
  • 1 cycle 16k I/D caches
  • Single issue
  • Baseline 1 cycle subgraph execution latency

11
Speedup Baseline PCFU
  • 4-inputs, 2-outputs PCFU design

12
Number of inputs/outputs
Area is proportional
13
Number of addition/subtractions
14
Collapsing Emulation
15
Shift support
16
Design points
4I, 2O, 2A, None
5I, 3O, 2A, None
4I, 3O, 2A, None
17
Conclusions
  • Transparent Instruction Set Customization needs
  • Extracting computations from program
  • Efficient Substrate to Map subgraphs
  • PCFU LUT Based accelerators
  • Flexible configurable accelerators
  • Efficient configuration
  • You can get up to 66 with a 6 input / 3 out / 2
    Adder PCFU
  • ... but you get 62 with a 8 time smaller, 40
    faster PCFU

18
Q A
19
Backups
20
PCFU Design Space
  Latency (ns) Area (cells) Speedup
CCA(Michigan) 4.32 278748 1.8
CCA (RD) 7.07 606345  1.8
PCFU (2 AS/4IN/2OUT) 4.2 171305 1.62
PCFU Logic only 0.59 26007 1.18
PCFU 1 ADD 2.15 63603 1.33
PCFU 2 ADD 3.79 134637 1.62
PCFU 3 ADD 5.82 274939 1.63
PCFU 2 IN 3.03 52437 1.49
PCFU 3 IN 3.24 68846 1.56
PCFU 4 IN 3.79 134637 1.62
PCFU 5 IN 5.25 214885 1.63
PCFU 6 IN 5.47 465630 1.63
PCFU (1 OUT) 3.79 134637 1.45
PCFU (2 OUT) 4.2 171305 1.62
PCFU (3 OUT) 4.57 230189 1.63
PCFU (Shift at inputs) 5.02 170529 1.75
PCFU(Shift at outputs) 4.45 158009 1.74
21
LUT-based accelerator
r1
r2
ADD r4,r1,r2 XOR r5,r3,r4
r3

r4
?
r5
r5i r3i ? (r1i ? r2i ? cini-1) cini (r1i.r2i)
OR (r1i ? r2i).cini-1
  • Closer to FPGA
  • Bit level functions too complex
  • Proposed Ripple Carry Scheme too slow
  • May involve carry propagation network very
    complex also
  • Hard to configure and have a reasonable latency
    in a GPP

Cini-1 r3i r2i r1i r5i
0 0 0 0 0
0 0 0 1 1
0 0 1 0 1
0 0 1 1 0
0 1 0 0 1
0 1 0 1 0
0 1 1 0 0
0 1 1 1 1
1 0 0 0 1
1 0 0 1 0
1 0 1 0 0
1 0 1 1 1
1 1 0 0 0
1 1 0 1 1
1 1 1 0 0
1 1 1 1 0
Write a Comment
User Comments (0)
About PowerShow.com