Title: Spatial%20Computation%20Computing%20without%20General-Purpose%20Processors
1Spatial ComputationComputing without
General-Purpose Processors
- Mihai Budiu
- mihaib_at_cs.cmu.edu
- Carnegie Mellon University
- July 8, 2004
2Spatial Computation
Spatial Computation
- A computation model based on
- application-specific hardware
- no interpretation
- minimal resource sharing
Mihai Budiu mihaib_at_cs.cmu.edu Carnegie Mellon
University
3The Engine Behind This Talk
- main( )
-
- signal(SIGINT, welcome)
- while (slides( ) time( ))
- talk( )
-
4Research Scope
Object future architectures
Toolcompilers
Evaluationsimulators
5Research Methodology
Y (e.g., cost)
reasonable limits
state-of-the-art
X (e.g., power)
Constraint Space
6Outline
- Introduction problems of current architectures
- Compiling Application-Specific Hardware
- Pipelining
- ASH Evaluation
- Conclusions
1000
Performance
7Resources
Intel
- We do not worry about not having hardware
resources - We worry about being able to use hardware
resources
8Design Complexity
1010
109
108
107
Chip size
Transistors
106
105
Designer productivity
104
1981
1983
1985
1987
1989
1991
1993
1995
1997
1999
2003
2001
2005
2007
2009
9Communication vs. Computation
wire
gate
5ps
20ps
Power consumption on wires is also dominant
10Power Consumption
Toasted CPU about 2 sec after removing cooler.
(Toms Hardware Guide)
11Energy Efficiency
Pentium 4
12Clock Speed
3GHz 6GHz 10GHz
Cannot rely on global signals
(clock is a global signal)
13Instruction-Set Architecture
Software
ISA
Hardware
14Our Proposal
- ASH addresses these problems
- ASH is not a panacea
- ASH complementary to CPU
15Outline
- Problems of current architectures
- CASH Compiling ASH
- program representation
- compiling C programs
- Pipelining
- ASH Evaluation
- Conclusions
16Application-Specific Hardware
C program
Compiler
Dataflow IR
Reconfigurable/custom hw
17Application-Specific Hardware
Soft
C program
Compiler
Dataflow IR
SW backend
Machine code
CPU predication
18Key Intermediate Representation
Traditionally
Our IR
- SSA predication speculation
- Uniform for scalars and memory
- Explicitly encodes may-depend
- Executable
- Precise semantics
- Dataflow IR
- Close to asynchronous target
may-dep.
CFG
...
def-use
19Computation Dataflow
Programs
Circuits
a
7
x a 7 ... y x gtgt 2
2
x
gtgt
- Operations ) functional units
- Variables ) wires
- No interpretation
20Basic Computation
latch
data
ack
valid
21Asynchronous Computation
data
ack
valid
1
22Distributed Control Logic
ack
rdy
-
short, local wires
asynchronous control
23Outline
- Problems of current architectures
- CASH Compiling ASH
- program representation
- compiling C programs
- Pipelining
- ASH Evaluation
- Conclusions
24MUX Forward Branches
x
b
0
if (x gt 0) y -x else y bx
-
gt
!
f
y
critical path
Conditionals ) Speculation
25Control Flow ) Data Flow
data
f
Merge (label)
data
data
predicate
Gateway
26Loops
- int sum0, i
- for (i0 i lt 100 i)
- sum ii
- return sum
27Predication and Side-Effects
addr
token
to memory
Load
pred
data
token
28Memory Access
LD
Monolithic Memory
pipelined arbitrated network
ST
LD
local communication
global structures
Future work fragment this!
related work
complexity
29CASH Optimizations
- SSA-based optimizations
- unreachable/dead code, gcse, strength reduction,
loop-invariant code motion, software pipelining,
reassociation, algebraic simplifications,
induction variable optimizations, loop unrolling,
inlining - Memory optimizations
- dependence alias analysis, register promotion,
redundant load/store elimination, memory access
pipelining, loop decoupling - Boolean optimizations
- Espresso CAD tool, bitwidth analysis
30Outline
- Problems of current architectures
- Compiling ASH
- Pipelining
- Evaluation CASH vs. clocked designs
- Conclusions
31Pipelining
i
1
100
lt
pipelined multiplier (8 stages)
sum
- int sum0, i
- for (i0 i lt 100 i)
- sum ii
- return sum
step 1
32Pipelining
i
1
100
lt
sum
step 2
33Pipelining
i
1
100
lt
sum
step 3
34Pipelining
i
1
100
lt
sum
step 4
35Pipelining
i
1
100
i1
lt
i0
sum
step 5
36Pipelining
i
1
100
i1
lt
i0
sum
step 6
37Pipelining
i
1
100
lt
sum
step 7
38Pipelining
i
1
100
critical path
lt
Predicate ackedge is on the critical path.
sum
39Pipeline balancing
i
1
100
lt
decoupling FIFO
sum
step 7
40Pipeline balancing
i
1
100
lt
critical path
is loop
decoupling FIFO
sum
sums loop
41Outline
- Problems of current architectures
- Compiling ASH
- Pipelining
- Evaluation CASH vs. clocked designs
- Conclusions
42Evaluating ASH
Mediabench kernels (1 hot function/benchmark)
C
CASHcore
Verilog back-end
Synopsys,Cadence P/R
180nm std. cell library, 2V
1999 technology
ModelSim (Verilog simulation)
performancenumbers
Mem
ASIC
43ASH Area
P4 217
minimal RISC core
normalized area
44ASH vs 600MHz CPU .18 mm
45Bottleneck Memory Protocol
LD
Memory
ST
46Power
Xeon cache 67000
mP 4000
DSP 110
47Energy Efficiency
Dedicated hardware
ASH media kernels
Asynchronous ?P
FPGAs
General-purpose DSP
Microprocessors
0
.
1
1
0
1
1
0
0
0
0
0
1
1
0
0
.
Energy Efficiency Operations/nJ
48Outline
-
- Problems of current architectures
- Compiling ASH
- Pipelining
- ASH Evaluation
- Future/related work conclusions
49Related Work
Nanotechnology
Dataflowmachines
Asynchronouscircuits
High-levelsynthesis
Embeddedsystems
Reconfigurablecomputing
Computerarchitecture
Compilation
50Future Work
- Optimizations for area/speed/power
- Memory partitioning
- Concurrency
- Compiler-guided layout
- Explore extensible ISAs
- Hybridization with superscalar mechanisms
- Reconfigurable hardware support for ASH
- Formal verification
51Grand VisionCertified Circuit Generation
- Translation validation input output
- Preserve input properties
- e.g., C programs cannot deadlock
- e.g., type-safe programs cannot crash
- Debug, test, verify only at source-level
How far can you go?
HLL
IR
IRopt
Verilog
gates
layout
formally validated
52Conclusions
Spatial computation strengths
Feature Advantages
No interpretation Energy efficiency, speed
Spatial layout Short wires, no contention
Asynchronous Low power, scalable
Distributed No global signals
Automatic compilation Design productivity, no ISA
53Backup Slides
- Reconfigurable hardware
- Critical paths
- Control logic
- ASH vs ...
- ASH weaknesses
- Exceptions
- Normalized area
- Why C?
- Splitting memory
- More performance
- Recursive calls
54Reconfigurable Hardware
55Main RH Ingredient RAM Cell
data in
0
control
Switch controlled by a 1-bit RAM cell
back
56Critical Paths
x
b
0
if (x gt 0) y -x else y bx
-
gt
!
y
57Lenient Operations
x
b
0
if (x gt 0) y -x else y bx
!
y
Solves the problem of unbalanced paths
back to talk
back
58Asynchronous Control
ackout
C
rdyin
D
rdyout
ackin
Reg
datain
dataout
back
back to talk
59HLL to HW
High-level Synthesis Behavioral HDL Synchronou
s Hardware
ReconfigurableComputing C subsets Hardware
configuration (spatial computation)
Asynchronous circuits Concurrent Language Async
hronous Hardware
Prior work
This research
60CASH vs High-Level Synthesis
- CASH the only existing tool to translate
complete ANSI C to hardware - CASH generates asynchronous circuits
- CASH does not treat C as an HDL
- no annotations required
- no reactivity model
- does not handle non-C, e.g., concurrency
back
61ASH Weaknesses
- Low efficiency for low-ILP code
- Does not adapt at runtime
- Monolithic memory
- Resource waste
- Not flexible
- No support for exceptions
62ASH Weaknesses (2)
- Both branch and join not free
- Static dataflow (no re-issue of same instr)
- Memory is far
- Fully static
- No branch prediction
- No dynamic unrolling
- No register renaming
- Calls/returns not lenient
back
63Branch Prediction
i
1
- for (i0 i lt N i)
- ...
- if (exception) break
lt
exception
!
back
64Exceptions
- Strictly speaking, C has no exceptions
- In practice hard to accommodate exceptions in
hardware implementations - An advantage of software flexibility PC is
single point of execution control
CPU
ASH
Low ILP computation OS VM exceptions
High-ILP computation
Memory
back
65Why C
- Huge installed base
- Embedded specifications written in C
- Small and simple language
- Can leverage existing tools
- Simpler compiler
- Techniques generally applicable
- Not a toy language
back
66Performance
67Parallelism Profile
68Normalized Area
back
back to talk
69Memory Partitioning
- MIT RAW project Babb FCCM 99, Barua HiPC
00,Lee ASPLOS 00 - Stanford SpC Semeria DAC 01, TVLSI 02
- Berkeley CCured Necula POPL 02
- Illinois FlexRAM Fraguella PPoPP 03
- Hand-annotations pragma
back
back to talk
70Memory Complexity
RAM
LSQ
addr
data
back
back to talk
71Recursion
save live values
recursive call
restore live values
stack
back
72Me?