Title: Operation Chaining Asynchronous Pipelined Circuits
1Operation Chaining Asynchronous Pipelined Circuits
- Girish Venkataramani
- Seth C. Goldstein
2Introduction
t1
- Operation chaining coalesces nodes across time
steps - Optimize energy efficiency in bundled data
asynchronous pipelines circuits - Formulated as a vertex covering problem
- Implemented within CASH IWLS 04
- Average energy-delay improves by 1.4x
t2
t1
3Outline
- Background and Motivation
- Problem Formulation Related Work
- Algorithm Overview
- Experimental Results
- Conclusions
4Asynchronous Pipelines
ALU1
Data
ALU2
Latch
Latch
Delay
Delay
Req
H/S
H/S
Ack
Stage 1
Stage 2
Control-path
- Self-timed circuits
- Clocked by handshake controller (H/S)
- Protocol-based communication
- Dynamically scheduled using ReqAck signals
- Can be clustered into stages
Datapath
5Motivation for Operation Chaining
- Energy profile for gsm_encoder kernel
- Generated by CASH IWLS 04, TCAD 06
- Protocol Four-phase bundled data
C
Delay
60
H/S
reg
6The Op-Chaining Idea
C
C
C
Delay
Delay
Delay
H/S
H/S
reg
reg
H/S
reg
C
Delay
- Benefits
- Eliminated H/S Reg
- Eliminated some H/S signals
- Faster datapath
H/S
reg
7The Op-Chaining Idea
C
C
C
Delay
Delay
Objective of Operation Chaining Minimize
overall energy without degrading performance
compared to fully pipelined system
Delay
H/S
H/S
reg
reg
H/S
reg
C
Delay
- Drawbacks
- Reduced pipeline parallelism
- Increased datapath power
- Regs act as glitch filters
H/S
reg
8Outline
- Background and Motivation
- Problem Formulation Related Work
- Algorithm Overview
- Experimental Results
- Conclusions
9Problem Formulation
- Given G(V,E)
- Node v is a potential pipeline stage
- Edge (u,v) is a potential bundled data channel
- Find vertex cover,
- s S1,,Sn
- Each sub-graph, Si, is a pipeline stage
- For a fully pipelined system
- S V
- Constraints
- Correctness is preserved
- Performance is equal to or better than fully
pipelined system - Objective Maximize energy savings
- Control-path energy savings gt Datapath energy
increase
10Related Work
- Minimizing registers in asynchronous circuit is
NP-complete Kim, ICCAD 00 - Control-path unchanged
- Retiming in synchronous world
- Clock cycle constraints exist
- Registers per pipeline loop is constant
- This work addresses pipeline register and
handshake controller minimization - NP-hard
- Potential for more energy savings
11Outline
- Background and Motivation
- Problem Formulation Related Work
- Algorithm Overview
- Experimental Results
- Conclusions
12Requirements
- Correctness of solution
- Output functionality must be preserved
- System should be deadlock-free
- Predict impact of op-chaining on performance
- Use Global Critical Path and global slack DAC
07 - Predict impact on datapath energy
- Bit operations heuristic
- Simplifying constraint single-output stages
13Divide-and-Conquer Strategy
Use constraints to partition graph
Dynamic programming evaluates candidates
Dual Problem Find stages assigned regs
?
14Correctness Constraints
- Pre-assign some nodes to contain registers and
handshake controllers - I/O functionality
- Deadlock-free guarantees
- Every primary input and output node, say I/O, of
G must contain a handshake controller and register
15Deadlock Constraints
- Protocol induces pipeline requirements
- Regs(d) registers in a loop with d
execution threads - Pre-assign (at least) Regs(d) stages in every
pipeline loop to contain registers - Let all pre-assigned stages be
16Constraints Partition the Graph
x
i
k
w
m
d
c
z
v
n
s
e
r
b
h
a
17Single-Output Constraint
x post-dominates y if all paths from x to primary
output passes through y Use post-dominator
relationships to further partition each candidate
sub-graphs
w
d
c
v
n
n
s
e
b
a
Cannot be within an op-chained sub-graph
18Post-Dominator Trees
w
d
c
v
n
s
e
b
a
Post-Dom Trees
All post-dom tree roots must contain registers
b
a
n
c
e
d
v
s
w
19Evaluate Single-Output Candidates
x
i
k
w
m
d
c
z
v
n
s
e
r
b
h
a
Each candidate is evaluated independent of each
other. Final step Find best partitions within
each candidate ? Partition the
post-dominator tree of each candidate
20Performance Cost
m
D() 1 D() 3
b
- Cycle time determines system-level timing
- Largest cycle in the underlying Petri-Net
- Global Critical Path DAC 07
- Compute Global Slack for each node
- How much can node be delayed without affecting
cycle time? - Timing budget for system-level timing
c
e
a
21Performance Cost
for each primary input, there exists an ack signal
ack
- Find delay in response time for each ack at P.I.
- Compare with Global Slack of each ack leading to
the constraint
Ack Response Time Before op-chaining After
op-chaining
22Datapath Power Cost
- Glitches (intermediate results) in datapath lead
to useless switching - Registers between stages filter these glitches
- With op-chaining, glitches increase
- Heuristic Bit-operations is indicator of
potential glitching - Roughly number of std-cell gate
23Algorithm Overview
Pre-assign Nodes
1
- Requirements
- Correctness
- Single-output
- Timing
- Power
Enumerate Post-Dominator (PD) tree candidates
Dyn. prog. evaluates timing/power costs
2
Partition each PD tree independently
4
3
Complexity O(E V2)
24Outline
- Background and Motivation
- Problem Formulation Related Work
- Algorithm Overview
- Experimental Results
- Conclusions
25Experimental Results
- Applied op-chaining within CASH
- C to asynchronous circuits compiler
- Benchmarks Mediabench Lee 97
- Circuits mapped to 180nm/2V ST Microelectronics
standard-cell library - Four data points
no timing, power constraints
strict timing, power constraints
R1
R2
R3
Greedy
Most Conservative
Most Aggressive
26H/SRegs Eliminated
27Energy Efficiency
Relative to fully pipeline system
Energy-Delay
Performance
28Outline
- Background and Motivation
- Problem Formulation Related Work
- Algorithm Overview
- Experimental Results
- Conclusions
29Conclusions
- Formulated as a vertex covering problem
- Divide-and-conquer algorithm
- Global-Slack predicts performance impact
- Bit-operations predict datapath glitch impact
- Quadratic complexity
- Energy-delay boosted by 1.4x, on average
- No performance degradation
- Operation chaining exploits asynchronous design
opportunities - Use non-uniform stage latencies
- Power-performance trade-off
30Thank you for your attention!