Mihai Budiu - PowerPoint PPT Presentation

About This Presentation

Title:

Mihai Budiu

Description:

Mihai Budiu. Microsoft Research Silicon Valley. Girish ... Cannot rely on global signals (clock is a global signal) 5ps. 20ps. gate. wire. Automatic ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 45

Provided by: MIh73

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Mihai Budiu

1
Spatial ComputationComputing without
General-Purpose Processors

Mihai Budiu
Microsoft Research Silicon Valley
Girish Venkataramani, Tiberiu Chelcea, Seth Copen
Goldstein
Carnegie Mellon University

2
Outline

Intro Problems of current architectures
Compiling Application-Specific Hardware
ASH Evaluation
Conclusions

1000
Performance
3
Resources
Intel

We do not worry about not having hardware
resources
We worry about being able to use hardware
resources

4
Complexity
Cannot rely on global signals
(clock is a global signal)
5
Complexity
Automatic translation C ! HW
Simple, short, unidirectional interconnect
Simple hw, mostly idle
gate
wire
5ps
20ps
No interpretation
Distributed control, Asynchronous
Cannot rely on global signals
(clock is a global signal)
6
Our ProposalApplication-Specific Hardware

ASH addresses these problems
ASH is not a panacea
ASH complementary to CPU

7
Paper Content

Automatic translation of C to hardware dataflow
machines
High-level comparison of dataflow and
superscalar
Circuit-level evaluation -- power, performance,
area

8
Outline

Problems of current architectures
CASH Compiling Application-Specific Hardware
ASH Evaluation
Conclusions

9
Application-Specific Hardware
C program
Compiler
Dataflow IR
Reconfigurable/custom hw
10
Computation Dataflow
Program
IR
Circuits
a
a
7
x a 7 ... y x gtgt 2

7
2
x
gtgt
gtgt2
Operations Nodes Pipeline stages
Variables Def-use edges Channels (wires)
No interpretation
11
Basic ComputationPipeline Stage

latch
data
ack
valid
12
Distributed Control Logic
ack
rdy

-
short, local wires
13
MUX Forward Branches
x
b
0
if (x gt 0) y -x else y bx

-
gt
!
f
y
Conditionals ) Speculation
14
Memory Access
LD
Monolithic Memory
pipelined arbitrated network
ST
LD
local communication
global structures
Future work fragment this!
15
Outline

Problems of current architectures
Compiling ASH
ASH Evaluation
Conclusions

16
Evaluating ASH
Mediabench kernels (1 hot function/benchmark)
C
CASHcore
Verilog back-end
commercial tools
Synopsys,Cadence P/R
180nm std. cell library, 2V
1999 technology
ModelSim (Verilog simulation)
performancenumbers
Mem
ASIC
17
Compile Time
C
200 lines
CASHcore
20 seconds
Verilog back-end
10 seconds
20 minutes
Synopsys,Cadence P/R
1 hour
Mem
ASIC
18
ASH Area
P4 217
minimal RISC core
19
ASH vs 600MHz CPU .18 mm
20
Bottleneck Memory Protocol
LD
Memory
ST
21
Power
Xeon cache 67000
mP 4000
DSP 110
22
Energy-delay vs. Wattch
23
Energy Efficiency
Dedicated hardware
ASH media kernels
Asynchronous ?P
FPGA
General-purpose DSP
Microprocessors
0
.
1
1
0
1
1
0
0
0
0
0
1
1
0
0
.
Energy Efficiency Operations/nJ
24
Outline

Problems of current architectures
Compiling ASH
Evaluation
Related work, Conclusions

25
Related Work

Optimizing compilers
High-level synthesis
Reconfigurable computing
Dataflow machines
Asynchronous circuits
Spatial computation

We target an extreme point in the design
space no interpretation,fully distributed
computation and control
26
ASH Design Point

Design an ASIC in a day
Fully automatic synthesis to layout
Fully distributed control and computation
(spatial computation)
Replicate computation to simplify wires
Energy/op rivals custom ASIC
Performance rivals superscalar
Et 100 times better than any processor

27
Conclusions
Spatial computation strengths
Feature Advantages
No interpretation Energy efficiency, speed
Spatial layout Short wires, no contention
Asynchronous Low power, scalable
Distributed No global signals
Automatic compilation Designer productivity
28
Backup Slides

Absolute performance
Control logic
Exceptions
Leniency
Normalized area
Loops
ASH weaknesses
Splitting memory
Recursive calls
Leakage
Why not compare to
Targetting FPGAs

29
Absolute Performance
30
Pipeline Stage
ackout
C
rdyin
rdyout
ackin

D
Reg
datain
dataout
back
31
Exceptions

Strictly speaking, C has no exceptions
In practice hard to accommodate exceptions in
hardware implementations
An advantage of software flexibility PC is
single point of execution control

CPU
ASH
Low ILP computation OS VM exceptions
High-ILP computation

Memory
back
32
Critical Paths
x
b
0
if (x gt 0) y -x else y bx

-
gt
!
y
33
Lenient Operations
x
b
0
if (x gt 0) y -x else y bx

!
y
Solves the problem of unbalanced paths
back
34
Normalized Area
back
35
Control Flow ) Data Flow
data
f
Merge (label)
data
data
predicate
Gateway
36
Loops

int sum0, i
for (i0 i lt 100 i)
sum ii
return sum

back
37
ASH Weaknesses

Both branch and join not free
Static dataflow (no re-issue of same instr)
Memory is far
Fully static
No branch prediction
No dynamic unrolling
No register renaming
Calls/returns not lenient

back
38
Branch Prediction
i
1

for (i0 i lt N i)
...
if (exception) break

lt
exception
!

back
39
Memory Partitioning

MIT RAW project Babb FCCM 99, Barua HiPC
00,Lee ASPLOS 00
Stanford SpC Semeria DAC 01, TVLSI 02
Illinois FlexRAM Fraguella PPoPP 03
Hand-annotations pragma

back
40
Recursion
save live values
recursive call
restore live values
stack
back
41
Leakage Power

Ps k Area e-VT
Employ circuit-level techniques
Cut power supply of idle circuit portions
most of the circuit is idle most of the time
strong locality of activity
High VT transistors on non-critical path

back
42
Why Not Compare To

In-order processor
Worse in all metrics than superscalar, except
power
We beat it in all metrics, including performance
DSP
We expect roughly the same results as for
superscalar(Wattch maintains high IPC for these
kernels)
ASIC
No available tool-flow supports C to the same
degree
Asynchronous ASIC
We compared with a Balsa synthesis system
We are 15 times better in Et compared to
resulting ASIC
Async processor
We are 350 times better in Et than Amulet (scaled
to .18)