Mihai Budiu - PowerPoint PPT Presentation

About This Presentation
Title:

Mihai Budiu

Description:

Mihai Budiu. Microsoft Research Silicon Valley. Girish ... Cannot rely on global signals (clock is a global signal) 5ps. 20ps. gate. wire. Automatic ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 45
Provided by: MIh73
Learn more at: http://www.cs.cmu.edu
Category:
Tags: budiu | mihai | rely

less

Transcript and Presenter's Notes

Title: Mihai Budiu


1
Spatial ComputationComputing without
General-Purpose Processors
  • Mihai Budiu
  • Microsoft Research Silicon Valley
  • Girish Venkataramani, Tiberiu Chelcea, Seth Copen
    Goldstein
  • Carnegie Mellon University

2
Outline
  • Intro Problems of current architectures
  • Compiling Application-Specific Hardware
  • ASH Evaluation
  • Conclusions

1000
Performance
3
Resources
Intel
  • We do not worry about not having hardware
    resources
  • We worry about being able to use hardware
    resources

4
Complexity
Cannot rely on global signals
(clock is a global signal)
5
Complexity
Automatic translation C ! HW
Simple, short, unidirectional interconnect
Simple hw, mostly idle
gate
wire
5ps
20ps
No interpretation
Distributed control, Asynchronous
Cannot rely on global signals
(clock is a global signal)
6
Our ProposalApplication-Specific Hardware
  • ASH addresses these problems
  • ASH is not a panacea
  • ASH complementary to CPU

7
Paper Content
  • Automatic translation of C to hardware dataflow
    machines
  • High-level comparison of dataflow and
    superscalar
  • Circuit-level evaluation -- power, performance,
    area

8
Outline
  • Problems of current architectures
  • CASH Compiling Application-Specific Hardware
  • ASH Evaluation
  • Conclusions

9
Application-Specific Hardware
C program
Compiler
Dataflow IR
Reconfigurable/custom hw
10
Computation Dataflow
Program
IR
Circuits
a
a
7
x a 7 ... y x gtgt 2

7
2
x
gtgt
gtgt2
Operations Nodes Pipeline stages
Variables Def-use edges Channels (wires)
No interpretation
11
Basic ComputationPipeline Stage

latch
data
ack
valid
12
Distributed Control Logic
ack
rdy

-
short, local wires
13
MUX Forward Branches
x
b
0
if (x gt 0) y -x else y bx

-
gt
!
f
y
Conditionals ) Speculation
14
Memory Access
LD
Monolithic Memory
pipelined arbitrated network
ST
LD
local communication
global structures
Future work fragment this!
15
Outline
  • Problems of current architectures
  • Compiling ASH
  • ASH Evaluation
  • Conclusions

16
Evaluating ASH
Mediabench kernels (1 hot function/benchmark)
C
CASHcore
Verilog back-end
commercial tools
Synopsys,Cadence P/R
180nm std. cell library, 2V
1999 technology
ModelSim (Verilog simulation)
performancenumbers
Mem
ASIC
17
Compile Time
C
200 lines
CASHcore
20 seconds
Verilog back-end
10 seconds
20 minutes
Synopsys,Cadence P/R
1 hour
Mem
ASIC
18
ASH Area
P4 217
minimal RISC core
19
ASH vs 600MHz CPU .18 mm
20
Bottleneck Memory Protocol
LD
Memory
ST
21
Power
Xeon cache 67000
mP 4000
DSP 110
22
Energy-delay vs. Wattch
23
Energy Efficiency
Dedicated hardware
ASH media kernels
Asynchronous ?P
FPGA
General-purpose DSP
Microprocessors
0
.
1
1
0
1
1
0
0
0
0
0
1
1
0
0
.
Energy Efficiency Operations/nJ
24
Outline
  • Problems of current architectures
  • Compiling ASH
  • Evaluation
  • Related work, Conclusions

25
Related Work
  • Optimizing compilers
  • High-level synthesis
  • Reconfigurable computing
  • Dataflow machines
  • Asynchronous circuits
  • Spatial computation

We target an extreme point in the design
space no interpretation,fully distributed
computation and control
26
ASH Design Point
  • Design an ASIC in a day
  • Fully automatic synthesis to layout
  • Fully distributed control and computation
  • (spatial computation)
  • Replicate computation to simplify wires
  • Energy/op rivals custom ASIC
  • Performance rivals superscalar
  • Et 100 times better than any processor

27
Conclusions
Spatial computation strengths
Feature Advantages
No interpretation Energy efficiency, speed
Spatial layout Short wires, no contention
Asynchronous Low power, scalable
Distributed No global signals
Automatic compilation Designer productivity
28
Backup Slides
  • Absolute performance
  • Control logic
  • Exceptions
  • Leniency
  • Normalized area
  • Loops
  • ASH weaknesses
  • Splitting memory
  • Recursive calls
  • Leakage
  • Why not compare to
  • Targetting FPGAs

29
Absolute Performance
30
Pipeline Stage
ackout
C
rdyin
rdyout
ackin

D
Reg
datain
dataout
back
31
Exceptions
  • Strictly speaking, C has no exceptions
  • In practice hard to accommodate exceptions in
    hardware implementations
  • An advantage of software flexibility PC is
    single point of execution control

CPU
ASH
Low ILP computation OS VM exceptions
High-ILP computation

Memory
back
32
Critical Paths
x
b
0
if (x gt 0) y -x else y bx

-
gt
!
y
33
Lenient Operations
x
b
0
if (x gt 0) y -x else y bx

!
y
Solves the problem of unbalanced paths
back
34
Normalized Area
back
35
Control Flow ) Data Flow
data
f
Merge (label)
data
data
predicate
Gateway
36
Loops
  • int sum0, i
  • for (i0 i lt 100 i)
  • sum ii
  • return sum

back
37
ASH Weaknesses
  • Both branch and join not free
  • Static dataflow (no re-issue of same instr)
  • Memory is far
  • Fully static
  • No branch prediction
  • No dynamic unrolling
  • No register renaming
  • Calls/returns not lenient

back
38
Branch Prediction
i
1
  • for (i0 i lt N i)
  • ...
  • if (exception) break

lt
exception
!

back
39
Memory Partitioning
  • MIT RAW project Babb FCCM 99, Barua HiPC
    00,Lee ASPLOS 00
  • Stanford SpC Semeria DAC 01, TVLSI 02
  • Illinois FlexRAM Fraguella PPoPP 03
  • Hand-annotations pragma

back
40
Recursion
save live values
recursive call
restore live values
stack
back
41
Leakage Power
  • Ps k Area e-VT
  • Employ circuit-level techniques
  • Cut power supply of idle circuit portions
  • most of the circuit is idle most of the time
  • strong locality of activity
  • High VT transistors on non-critical path

back
42
Why Not Compare To
  • In-order processor
  • Worse in all metrics than superscalar, except
    power
  • We beat it in all metrics, including performance
  • DSP
  • We expect roughly the same results as for
    superscalar(Wattch maintains high IPC for these
    kernels)
  • ASIC
  • No available tool-flow supports C to the same
    degree
  • Asynchronous ASIC
  • We compared with a Balsa synthesis system
  • We are 15 times better in Et compared to
    resulting ASIC
  • Async processor
  • We are 350 times better in Et than Amulet (scaled
    to .18)

back
43
Compared to Next Talk
Engine180nm PerformanceMIPS E/instructionpJ
SNAP/LE 28 24
SNAP/LE 240 218
ASH 1100 20
back
44
Why not target FPGA
  • Do not support asynchronous circuits
  • Very inefficient in area, power, delay
  • Too fine-grained for datapath circuits
  • We are designing an async FPGA

back
Write a Comment
User Comments (0)
About PowerShow.com