Operation Chaining Asynchronous Pipelined Circuits - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Operation Chaining Asynchronous Pipelined Circuits

Description:

Regs act as 'glitch filters' Objective of Operation Chaining: ... Bit-operations predict datapath glitch impact. Quadratic complexity ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 31
Provided by: Gir957
Category:

less

Transcript and Presenter's Notes

Title: Operation Chaining Asynchronous Pipelined Circuits


1
Operation Chaining Asynchronous Pipelined Circuits
  • Girish Venkataramani
  • Seth C. Goldstein

2
Introduction
t1

  • Operation chaining coalesces nodes across time
    steps
  • Optimize energy efficiency in bundled data
    asynchronous pipelines circuits
  • Formulated as a vertex covering problem
  • Implemented within CASH IWLS 04
  • Average energy-delay improves by 1.4x

t2



t1

3
Outline
  • Background and Motivation
  • Problem Formulation Related Work
  • Algorithm Overview
  • Experimental Results
  • Conclusions

4
Asynchronous Pipelines
ALU1
Data
ALU2
Latch
Latch
Delay
Delay
Req
H/S
H/S
Ack
Stage 1
Stage 2
Control-path
  • Self-timed circuits
  • Clocked by handshake controller (H/S)
  • Protocol-based communication
  • Dynamically scheduled using ReqAck signals
  • Can be clustered into stages

Datapath
5
Motivation for Operation Chaining
  • Energy profile for gsm_encoder kernel
  • Generated by CASH IWLS 04, TCAD 06
  • Protocol Four-phase bundled data

C
Delay
60

H/S
reg
6
The Op-Chaining Idea
C
C
C


Delay

Delay
Delay
H/S
H/S
reg
reg
H/S
reg
C

Delay
  • Benefits
  • Eliminated H/S Reg
  • Eliminated some H/S signals
  • Faster datapath

H/S
reg
7
The Op-Chaining Idea
C
C
C


Delay

Delay
Objective of Operation Chaining Minimize
overall energy without degrading performance
compared to fully pipelined system
Delay
H/S
H/S
reg
reg
H/S
reg
C

Delay
  • Drawbacks
  • Reduced pipeline parallelism
  • Increased datapath power
  • Regs act as glitch filters

H/S
reg
8
Outline
  • Background and Motivation
  • Problem Formulation Related Work
  • Algorithm Overview
  • Experimental Results
  • Conclusions

9
Problem Formulation
  • Given G(V,E)
  • Node v is a potential pipeline stage
  • Edge (u,v) is a potential bundled data channel
  • Find vertex cover,
  • s S1,,Sn
  • Each sub-graph, Si, is a pipeline stage
  • For a fully pipelined system
  • S V
  • Constraints
  • Correctness is preserved
  • Performance is equal to or better than fully
    pipelined system
  • Objective Maximize energy savings
  • Control-path energy savings gt Datapath energy
    increase

10
Related Work
  • Minimizing registers in asynchronous circuit is
    NP-complete Kim, ICCAD 00
  • Control-path unchanged
  • Retiming in synchronous world
  • Clock cycle constraints exist
  • Registers per pipeline loop is constant
  • This work addresses pipeline register and
    handshake controller minimization
  • NP-hard
  • Potential for more energy savings

11
Outline
  • Background and Motivation
  • Problem Formulation Related Work
  • Algorithm Overview
  • Experimental Results
  • Conclusions

12
Requirements
  • Correctness of solution
  • Output functionality must be preserved
  • System should be deadlock-free
  • Predict impact of op-chaining on performance
  • Use Global Critical Path and global slack DAC
    07
  • Predict impact on datapath energy
  • Bit operations heuristic
  • Simplifying constraint single-output stages

13
Divide-and-Conquer Strategy
Use constraints to partition graph
Dynamic programming evaluates candidates
Dual Problem Find stages assigned regs
?
14
Correctness Constraints
  • Pre-assign some nodes to contain registers and
    handshake controllers
  • I/O functionality
  • Deadlock-free guarantees
  • Every primary input and output node, say I/O, of
    G must contain a handshake controller and register

15
Deadlock Constraints
  • Protocol induces pipeline requirements
  • Regs(d) registers in a loop with d
    execution threads
  • Pre-assign (at least) Regs(d) stages in every
    pipeline loop to contain registers
  • Let all pre-assigned stages be

16
Constraints Partition the Graph
x
i
k
w
m
d
c
z
v
n
s
e
r
b
h
a
17
Single-Output Constraint
x post-dominates y if all paths from x to primary
output passes through y Use post-dominator
relationships to further partition each candidate
sub-graphs
w
d
c
v
n
n
s
e
b
a
Cannot be within an op-chained sub-graph
18
Post-Dominator Trees
w
d
c
v
n
s
e
b
a
Post-Dom Trees
All post-dom tree roots must contain registers
b
a
n
c
e
d
v
s
w
19
Evaluate Single-Output Candidates
x
i
k
w
m
d
c
z
v
n
s
e
r
b
h
a
Each candidate is evaluated independent of each
other. Final step Find best partitions within
each candidate ? Partition the
post-dominator tree of each candidate
20
Performance Cost
m

D() 1 D() 3
b
  • Cycle time determines system-level timing
  • Largest cycle in the underlying Petri-Net
  • Global Critical Path DAC 07
  • Compute Global Slack for each node
  • How much can node be delayed without affecting
    cycle time?
  • Timing budget for system-level timing


c



e
a
21
Performance Cost
for each primary input, there exists an ack signal
ack
  • Find delay in response time for each ack at P.I.
  • Compare with Global Slack of each ack leading to
    the constraint

Ack Response Time Before op-chaining After
op-chaining
22
Datapath Power Cost
  • Glitches (intermediate results) in datapath lead
    to useless switching
  • Registers between stages filter these glitches
  • With op-chaining, glitches increase
  • Heuristic Bit-operations is indicator of
    potential glitching
  • Roughly number of std-cell gate

23
Algorithm Overview
Pre-assign Nodes
1
  • Requirements
  • Correctness
  • Single-output
  • Timing
  • Power

Enumerate Post-Dominator (PD) tree candidates
Dyn. prog. evaluates timing/power costs
2
Partition each PD tree independently
4
3
Complexity O(E V2)
24
Outline
  • Background and Motivation
  • Problem Formulation Related Work
  • Algorithm Overview
  • Experimental Results
  • Conclusions

25
Experimental Results
  • Applied op-chaining within CASH
  • C to asynchronous circuits compiler
  • Benchmarks Mediabench Lee 97
  • Circuits mapped to 180nm/2V ST Microelectronics
    standard-cell library
  • Four data points

no timing, power constraints
strict timing, power constraints
R1
R2
R3
Greedy
Most Conservative
Most Aggressive
26
H/SRegs Eliminated
27
Energy Efficiency
Relative to fully pipeline system
Energy-Delay
Performance
28
Outline
  • Background and Motivation
  • Problem Formulation Related Work
  • Algorithm Overview
  • Experimental Results
  • Conclusions

29
Conclusions
  • Formulated as a vertex covering problem
  • Divide-and-conquer algorithm
  • Global-Slack predicts performance impact
  • Bit-operations predict datapath glitch impact
  • Quadratic complexity
  • Energy-delay boosted by 1.4x, on average
  • No performance degradation
  • Operation chaining exploits asynchronous design
    opportunities
  • Use non-uniform stage latencies
  • Power-performance trade-off

30
Thank you for your attention!
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com