Title: Spatial Computation
1Spatial Computation
Mihai Budiu CMU CS
- Thesis committee
- Seth Goldstein
- Peter Lee
- Todd Mowry
- Babak Falsafi
- Nevin Heintze
- Ph.D. Thesis defense, December 8, 2003
SCS
2Spatial Computation
A model of general-purpose computationbased on
Application-Specific Hardware.
Thesis committee Seth Goldstein Peter Lee Todd
Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis
defense, December 8, 2003
SCS
3Thesis Statement
- Application-Specific Hardware (ASH)
- can be synthesized by adapting software
compilation for predicated architectures, - provides high-performance for programs withhigh
ILP, with very low power consumption, - is a more scalable and efficient computation
substrate than monolithic processors.
not!
4Outline
- Introduction
- Compiling for ASH
- Media processing on ASH
- ASH vs. superscalar processors
- Conclusions
5CPU Problems
- Complexity
- Power
- Global Signals
- Limited ILP
6Design Complexity
from Michael Flynns FCRC 2003 talk
7Communication vs. Computation
wire
gate
5ps
20ps
Power consumption on wires is also dominant
8Our Approach ASH Application-Specific
Hardware
9Resource Binding Time
1.
1.
Programs
2.
2.
Programs
CPU
ASH
10Hardware Interface
software
software
ISA
virtual ISA
gates
hardware
hardware
CPU
ASH
11Application-Specific Hardware
C program
Dataflow IR
Compiler
dataflow machine
Reconfigurable/custom hw
12Contributions
Computerarchitecture
Embeddedsystems
Reconfigurablecomputing
Compilation
Asynchronouscircuits
High-levelsynthesis
Nanotechnology
Dataflowmachines
13Outline
- Introduction
- CASH Compiling for ASH
- Media processing on ASH
- ASH vs. superscalar processors
- Conclusions
14Computation Dataflow
Programs
Circuits
a
7
x a 7 ... y x gtgt 2
2
x
gtgt
- Operations ) functional units
- Variables ) wires
- No interpretation
15Basic Operation
latch
data
ack
valid
16Asynchronous Computation
data
ack
valid
1
17Distributed Control Logic
ack
rdy
-
short, local wires
asynchronous control
18Forward Branches
x
b
0
if (x gt 0) y -x else y bx
-
gt
!
y
critical path
Conditionals ) Speculation
19Control Flow ) Data Flow
data
Merge (label)
data
data
predicate
Gateway
20Loops
- int sum0, i
- for (i0 i lt 100 i)
- sum ii
- return sum
21Predication and Side-Effects
addr
token
to memory
Load
pred
data
token
22Thesis Statement
- Application-Specific Hardware
- can be synthesized by adapting software
compilation for predicated architectures, - provides high-performance for programs withhigh
ILP, with very low power consumption, - is a more scalable and efficient computation
substrate than monolithic processors.
not!
23Outline
- Introduction
- CASH Compiling for ASH
- An optimization on the SIDE
- Media processing on ASH
- ASH vs. superscalar processors
- Conclusions
skip to
24Availability Dataflow Analysis
y ab ... if (x) ... ... ab
25Dataflow Analysis Is Conservative
if (x) ... y ab ... ... ab
y?
26Static Instantiation, Dynamic Evaluation
flag false if (x) ... y ab
flag true ... ... flag ? y ab
27SIDE Register Promotion Impact
Loads
reduction
Stores
28Outline
- Introduction
- CASH Compiling for ASH
- Media processing on ASH
- ASH vs. superscalar processors
- Conclusions
29Performance Evaluation
Mem
L2 1/4M
ASH
L1 8K
LSQ
limited BW
CPU 4-way OOO
Assumption all operations have the same latency.
30Media Kernels, vs 4-way OOO
31Media Kernels, IPC
32Speed-up / IPC Correlation
33Low-Level Evaluation
C
CASHcore
Results shown so far. All results in thesis.
Verilog back-end
Synopsys,Cadence P/R
180nm std. cell library, 2V
1999 technology
Results in the next two slides.
ASIC
34Area
Reference P4 in 180nm has 217mm2
35Power
vs 4-way OOO superscalar, 600 Mhz, with clock
gating (Wattch), 6W
36Thesis Statement
- Application-Specific Hardware
- can be synthesized by adapting software
compilation for predicated architectures, - provides high-performance for programs withhigh
ILP, with very low power consumption, - is a more scalable and efficient computation
substrate than monolithic processors.
not!
37Outline
- Introduction
- CASH Compiling for ASH
- Media processing on ASH
- dataflow pipelining
- ASH vs. superscalar processors
- Conclusions
skip to
38Pipelining
i
1
100
lt
pipelined multiplier (8 stages)
sum
- int sum0, i
- for (i0 i lt 100 i)
- sum ii
- return sum
cycle1
39Pipelining
i
1
100
lt
sum
cycle2
40Pipelining
i
1
100
lt
sum
cycle3
41Pipelining
i
1
100
lt
sum
cycle4
42Pipelining
i
1
100
i1
lt
i0
sum
cycle5
pipeline balancing
43Outline
- Introduction
- CASH Compiling for ASH
- Media processing on ASH
- ASH vs. superscalar processors
- Conclusions
44This Is Obvious!
wrong!
- ASH runs at full dataflow speed, so CPU cannot
do any better(if compilers equally good).
45SpecInt95, ASH vs 4-way OOO
46Branch Prediction
i
1
- for (i0 i lt N i)
- ...
- if (exception) break
lt
exception
!
47SpecInt95, perfect prediction
48ASH Problems
- Both branch and join not free
- Static dataflow (no re-issue of same instr)
- Memory is far
- Fully static
- No branch prediction
- No dynamic unrolling
- No register renaming
- Calls/returns not lenient
- ...
49Thesis Statement
- Application-Specific Hardware
- can be synthesized by adapting software
compilation for predicated architectures, - provides high-performance for programs withhigh
ILP, with very low power consumption, - is a more scalable and efficient computation
substrate than monolithic processors.
not!
50Outline
- Introduction
- CASH Compiling for ASH
- Media processing on ASH
- ASH vs. superscalar processors
- Conclusions
51Strengths
- low power
- simple verification?
- specialized to app.
- unlimited ILP
- simple hardware
- no fixed window
- economies of scale
- highly optimized
- branch prediction
- control speculation
- full-dataflow
- global signals/decision
52Conclusions
- Compiling around the ISA is a fruitful research
approach. - Distributed computation structures require more
synchronization overhead. - Spatial Computation efficiently implements
high-ILP computation with very low power.
53Backup Slides
- Control logic
- Pipeline balancing
- Lenient execution
- Dynamic Critical Path
- Memory PRE
- Critical path analysis
- CPU ASH
54Control Logic
rdyin
C
C
ackin
D
rdyout
ackout
D
datain
dataout
Reg
back
back to talk
55Last-Arrival Events
- Event enabling the generation of a result
- May be an ack
- Critical pathcollection of last-arrival edges
data
ack
valid
56Dynamic Critical Path
- Some edges may repeat
-
- Trace back along last-arrival edges
- Start from last node
back
back to analysis
57Critical Paths
x
b
0
if (x gt 0) y -x else y bx
-
gt
!
y
58Lenient Operations
x
b
0
if (x gt 0) y -x else y bx
!
y
Solve the problem of unbalanced paths
back
back to talk
59Pipelining
i
1
100
i1
lt
i0
sum
cycle6
60Pipelining
i
1
100
lt
sum
cycle7
61Pipelining
i
1
100
critical path
lt
Predicate ackedge is on the critical path.
sum
62Pipelinine balancing
i
1
100
lt
decoupling FIFO
sum
cycle7
63Pipelinine balancing
i
1
100
lt
critical path
is loop
decoupling FIFO
sum
sums loop
back
back to presentation
64Register Promotion
(p1)
p
(p2 Æ p1)
(p2)
p
Load is executed only if store is not
65Register Promotion (2)
(p1)
p
(p1)
p
(false)
p
(p2)
p
- When p2 ) p1 the load becomes dead...
- ...i.e., when store dominates load in CFG
back
66¼ PRE
(p1)
(p2)
(p1 Ç p2)
...p
...p
...p
This corresponds in the CFG to lifting the load
to a basic block dominating the original loads
67Store-store (1)
(p1)
(p1 Æ p2)
p
p
(p2)
(p2)
p...
p...
- When p1 ) p2 the first store becomes dead...
- ...i.e., when second store post-dominates first
in CFG
68Store-store (2)
(p1)
(p1 Æ p2)
p
p
(p2)
(p2)
p...
p...
- Token edge eliminated, but...
- ...transitive closure of tokens preserved
back
69A Code Fragment
- for(i 0 i lt 64 i)
- for (j 0 Xj.r ! 0xF j)
- if (Xj.r i)
- break
- Yi Xj.q
SpecINT95124.m88ksiminit_processor, stylized
70Dynamic Critical Path
definition
sizeof(Xj)
load predicate
loop predicate
for (j 0 Xj.r ! 0xF j) if
(Xj.r i) break
71MIPS gcc Code
- LOOP
- L1 beq v0,a1,EXIT Xj.r i
- L2 addiu v1,v1,20 Xj1.r
- L3 lw v0,0(v1) Xj1.r
- L4 addiu a0,a0,1 j
- L5 bne v0,a3,LOOP Xj1.r 0xF
- EXIT
for (j 0 Xj.r ! 0xF j) if
(Xj.r i) break
L1! L2 ! L3 ! L5 ! L1 4-instructions loop-carried
dependence
72If Branch Prediction Correct
- LOOP
- L1 beq v0,a1,EXIT Xj.r i
- L2 addiu v1,v1,20 Xj1.r
- L3 lw v0,0(v1) Xj1.r
- L4 addiu a0,a0,1 j
- L5 bne v0,a3,LOOP Xj1.r 0xF
- EXIT
for (j 0 Xj.r ! 0xF j) if
(Xj.r i) break
L1! L2 ! L3 ! L5 ! L1 Superscalar is
issue-limited! 2 cycles/iteration sustained
73Critical Path with Prediction
Loads are not speculative
for (j 0 Xj.r ! 0xF j) if
(Xj.r i) break
74Prediction Load Speculation
ack edge
4 cycles! Load not pipelined (self-anti-dependenc
e)
for (j 0 Xj.r ! 0xF j) if
(Xj.r i) break
75OOO Pipe Snapshot
- LOOP
- L1 beq v0,a1,EXIT Xj.r i
- L2 addiu v1,v1,20 Xj1.r
- L3 lw v0,0(v1) Xj1.r
- L4 addiu a0,a0,1 j
- L5 bne v0,a3,LOOP Xj1.r 0xF
- EXIT
IF
DA
EX
WB
CT
L5 L1 L2
L1 L2 L3 L4
L1 L3
L5 L3 L2
L1 L3 L3
76Unrolling?
for(i 0 i lt 64 i) for (j 0
Xj.r ! 0xF j2) if (Xj.r i)
break if (Xj1.r 0xF)
break if (Xj1.r i)
break Yi Xj.q
when 1 iteration
back
back to talk
77Ideal Architecture
CPU
ASH
Low ILP computation OS VM
High-ILP computation
Memory
back