Design Automation for Streaming Systems

About This Presentation

Title:

Design Automation for Streaming Systems

Description:

From Programming Model to Hardware Model. Synthesis ... AMD Opteron dual core CPU: ~230M transistors ... 'Routing delays typically account for 45% to 65 ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 74

Provided by: eyl3

Learn more at: http://brass.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Design Automation for Streaming Systems

1
Design Automationfor Streaming Systems

Eylon Caspi
University of California, Berkeley
12/2/05

2
Outline

Streaming for Hardware
From Programming Model to Hardware Model
Synthesis Methodology for FPGA
Streams, Queues, SFSM Logic
Characterization of 7 Multimedia Apps
Optimizations
Pipelining, Placement, Queue Sizing, Decomposition

3
Large System Design Challenges

Devices growing with Moores Law
AMD Opteron dual core CPU 230M transistors
Xilinx Virtex 4 / Altera Stratix-II FPGAs 200K
LUTs
Problems of DSM, large systems
Growing interconnect delay, timing closure
Routing delays typically account for 45 to 65
of the total path delays (Xilinx Constraints
Guide)
Slow place-and-route
Design complexity
Designs do not scale well on next gen. device
must redesign
Same problems in FPGAs

4
Limitations of RTL

RTL Register Transfer Level
Fully exposed timing behavior
always _at_(posedge clk) ...
Laborious, error prone
Unpredictable interconnect delay
How deep to pipeline?
Redesign on next-gen device
Undermines reuse
Existing solutions
Modular design Floorplanning
Physical synthesis Hierarchical CAD
Latency insensitive communication

5
Streams

A better communication abstraction
Streams connect modules
FIFO buffered channel (queue)
Blocking read
Timing independent (deterministic)
Robust to communication delay
Pipeline across long distances
Robust to unknown delay
Post-placement pipelining
Alternate transport (packet switched NOC)
Flexibly timed module interfaces
Robust to module optimization (pipeline,
reschedule, etc.)
Enhances modular design reuse

6
Streaming Applications

Persistent compute structure (infrequent
changes)
Large data sets, mostly sequential access
Limited feedback
Implement with deep,system level pipelining
E.g. DSP, multimedia
JPEG Encoder

12/2/05
6
7
Ad Hoc Streaming

Every module needs streaming flow control
Block if inputs not available, output not ready
to receive
Every stream needs queueing
Pipeline to match interconnect delay
Queue to absorb delay mismatch, dynamic rates
Manual implementation, in HDL
Laborious (flow control, queues)
Error prone (deadlock if violate protocol, queue
too small)
No automation (pipeline depth, queue choice /
width / depth)
Interconnect / queue IP (e.g. OCP / Sonics Bus)
Still no automation

8
Systematic Streaming

Strong stream semantics Process Networks
Stream FIFO channel with (flavor of) blocking
read
E.g. Kahn Process Networks,E.g. Dataflow
Process Networks (E.A.Lee)
Streams as programming primitive
Language support hides flow control
Compiler support
Compiler generated flow control
Compiler controlled pipelining, queue depth,
queue impl.
Compiler optimizations (e.g. module merging,
partitioning)
Benefits
Easy, correct, high performance Portable
Paging / Virtualization is a logical extension
(Automatic page partitioning)

9
Outline

Streaming for Hardware
From Programming Model to Hardware Model
Synthesis Methodology for FPGA
Streams, Queues, SFSM Logic
Characterization of 7 Multimedia Apps
Optimizations
Pipelining, Placement, Queue Sizing, Decomposition

10
SCORE Model
Stream Computations Organized
for Reconfigurable Execution

Application Graph of stream-connected
operators
Operator Process with local state
Stream FIFO channel, unbounded
capacity, blocking read
Segment Memory, accessed via streams
Dynamics
Dynamic I/O rates
Dynamic graph construction
(omitted in this work)

11
SCORE Programming Model TDF

TDF behavioral language for
SFSM Operators (Streaming Extended Finite
State Machine)
Static operator graphs
State machine for
Firing control Sequencing, branching
Firing semantics
In state X, wait for Xs inputs, then evaluate
Xs action

i
j
state foo (i, j) o i j goto bar
o
12
SCORE / TDF Process Networks

A process from M inputs to N outputs,unified
stream type S (i.e. SM?SN)
SFSM (?, ?0, ?, R, fNS, fO)
? Set of states
?0 ? ? Initial state
? ? ? Present state
R ? (? ? SM) Set of firing rules
fNS R?? Next state function
fO R?SN Output function
Similar to dataflow process networks LeeParks,
IEEE May 95,but with stateful actors

13
Related Streaming Models

Streaming Models
Kahn PN, DFPN, BDF, SDF, CSDF, HDF,StreamsC,
YAPI, Catapult C, SHIM
Streaming Platforms
Pleiades, Philips VSP, Imagine, TRIPS
How do we differ?
Stateful processes
Deterministic
Dynamic dataflow rates (FSM nodes)
Direct synthesis to hardware
Bounded Buffers

14
Streaming Platforms

FPGA (this work)
Paged FPGA
Page fixed size partition, connected by
streams
Stylized page-to-page interconnect
Hierarchical PAR
Paged, Virtual FPGA (SCORE)
Time shared pages
Area abstraction (virtually large)
Multiprocessor on Chip
Heterogeneous

15
The Compilation Problem

Programming Model TDF Execution Model FPGA
Communicating SFSMs Single circuit /
configuration
- unrestricted size, IOs, timing - one or
more clocks
Unbounded stream buffering Fixed size queues

Compile
Big semantic gap
16
The Semantic Gap

Semantic gap between TDF, HW
Need to bind
Stream protocol
Stream pipelining
Queue implementation
Queue depths
SFSM synthesis style (behavioral synthesis)
Memory allocation
Primary I/O
SCORE device binds some implementation decisions
(custom hardware), raw FPGA does not
Want to characterize cost of implementation
decisions

17
Outline

Streaming for Hardware
From Programming Model to Hardware Model
Synthesis Methodology for FPGA
Streams, Queues, SFSM Logic
Characterization of 7 Multimedia Apps
Optimizations
Pipelining, Placement, Queue Sizing, Decomposition

18
Compilation Tool Flow
Application

Local optimization
System optimization
Queue sizing
Pipeline extraction
SFSM partitioning / merging
Pipelining
Generate flow ctl, streams, queues

TDF
tdfc
Verilog
Synplify

Behavioral Synthesis
Retiming

EDIF (Unplaced LUTs, etc.)
Xilinx ISE

Slice packing
Place and route

Device Configuration
Bits
19
Wire Protocol for Streams

D Data, V Valid, B Backpressure
Synchronous transaction protocol
Producer asserts V when D ready, Consumer
deasserts B when ready
Transaction commits if (?B ? V) at clock edge
Encode EOS E as extra D bit (out of band, easy
to enqueue)

20
Operator Firing

In state X, fire if
Inputs desired by X are ready (Valid, EOS)
Outputs emitted by X are ready (Backpressure)
Firing guard / control flow
if (iv !ie !ob) begin ib0 ov1
...end
Subtlety master, slave
Operator is slave
To synchronize streams, (1) wait for flow
control in, (2) fire / emit out
Connecting two slaves would deadlock
Need master (queue) between every pair of
operators

21
SFSMSynthesis
FSM
Datapath For State 1
Datapath For State 2

Implemented as Behavioral Verilog, using state
case in FSM and DP
FSM handles firing control, branching
FSM sends state to DP
DP sends bool. flags to FSM

Datapath
22
FSM Module, Firing Control

TDF
Verilog
FSM
Module

foo (input unsigned16 x, input
unsigned16 y, output unsigned16 o)
state one (x, eos(y)) ox1 ...
module foo_fsm (clock, reset, x_e, x_v, x_b, y_e,
y_v, y_b, o_e, o_v,
o_b, state, statecase) ... always _at_ begin
x_b_1 y_b_1 o_e_0 o_v_0
state_reg_ state_reg statecase_
statecase_stall did_goto_ 0 case
(state_reg) state_one begin if
(x_v !x_e y_v y_e !o_b) begin
statecase_ statecase_1 x_b_0
y_b_0 o_v_1 o_e_0 end
... end // always _at_ endmodule // foo_fsm
Default is stall Firing condition(s)for state
one Stream flow ctlfor state one
23
Data-Path Module

TDF
Verilog
Data-path
Module

foo (input unsigned16 x, input
unsigned16 y, output unsigned16 o)
state one (x, eos(y)) ox1 ...
module foo_dp (clock, reset, x_d, y_d, o_d,
state, statecase) ... always _at_ begin
o_d_16bx did_goto_ 0 case (state)
state_one begin if (statecase_
statecase_1) begin o_d_ (x_d
1d1) end ... end // always
_at_ endmodule // foo_dp
Default is stall Firing condition(s)for state
one Data-pathfor state one
24
Stream Buffers (Queues)

Systolic
Cascade of depth-1stages (or depth-N)
Shift register
Put shift all entries
Get tail pointer
Circular buffer
Memory withhead / tail pointers

25
Enabled Register Queue

Systolic, depth-1 stage
1 state bit (empty/full) V
Shift in data unless
Full and downstream not readyto consume queued
element
Area ? 1 FF per data bit
On FPGA ? 1 LUT cell per data bit
Depth-1 (single stage) nearly free,since FFs
pack with logic
Speed as fast as FF
But combinationally connects producer consumer
via B

26
Xilinx SRL16

SRL16 Shift register of depth 16 in one 4-LUT
cell
Shift register of arbitrary width parallel
SRL16, arbitrary depth cascade SRL16
Improve queue density by 16x

27
Shift Register Queue

State empty bit capacity counter
Data stored in shift register
In at position 0
Out at position Address
Address number of stored elements minus 1
Synplify infers SRL16E from Verilog array
Parameterized depth, width
Flow control
ov (StateNon-Empty)
ib !(AddressDepth-1)
Performance improvements
Registered data out
Registered flow control
Specialized, pre-computed fullness

28
SRL Queue with Registered Data Out

Registered data out
od (clock-to-Q delay)
Non-retimable
Data output register extends shift register
Bypass shift register when queue empty
3 States
Address numberof stored elementsminus 2
Flow control
ov !(StateEmpty)
ib (Address Depth-2)

29
SRL Queue with Registered Flow Ctl.

Registered flow ctl.
ov (clock-to-Q delay)
ib (clock-to-Q delay)
Non-retimable
Flow control
ov_next !(State_next
Empty)
ib_next (Address_next
Depth-2)
Based on pre-computed fullness
full_next (Address_next
Depth-2)

30
SRL Queue with Specialized,Pre-Computed Fullness

Speed up critical full pre-computation by
special-casing full_next for each state
Flow control
ov_next !(State_next
Empty)
ib_next full_next
zero pre-computation is less critical
Result
gt200MHz unless very large (e.g. 128 x 128)
All output delays are clock-to-Q
Area ? 3 x (SRL16E area)

31
SRL Queue Speed
32
SRL Queue Area
33
Page Synthesis

Page Cluster of Operator(s) Queues
SFSMs
One or more per page
Further decomposed into FSM, data-path
Page Input Queues
Deep
Drain pipelined page-to-page streams before
reconfiguration
In-page Queues
Shallow
Separately Synthesizable Modules
Separate characterization
Consider custom resources

34
Page Synthesis

Module Hierarchy

35
Outline

Streaming for Hardware
From Programming Model to Hardware Model
Synthesis Methodology for FPGA
Streams, Queues, SFSM Logic
Characterization of 7 Multimedia Apps
Optimizations
Pipelining, Placement, Queue Sizing, Decomposition

36
Tool Flow, Revisited
Application

Separate compilationfor application, SFSMs
Page SFSM
Datapath FSM

Tool Options

Identical queuing for every stream (SRL16
based, depth 16)
I/O boundary regs (for Xilinx static
timing analysis)

tdfc
Verilog

Synplify 8.0
Target 200MHz
Optimize FSM, retiming, pipelining
Retain monolithic FSM encodings

Synplify
EDIF (Unplaced LUTs, etc.)

ISE 6.3i
Constrain to minimum square area, at least
max slice packing 20, expand if fail PAR

Xilinx ISE
Device Configuration
Bits

Device XC2VP70 -7

37
PAR Flow for Minimum Area

EDIF netlist from Synplify
Constraints file
Page area
Target Period
ngdbuild
Convert netlist EDIF ? NGD
map
Pack LUTs, MUXes, etc. into slices
trce (pre-PAR)
Static timing analysis, logic only
par
Place and route
trce (post-PAR)
Static timing analysis

EDIF
Constraints
ngdbuild
map
Ok?
Target packed slices
no
yes
trce
Targetpacked timing
par
Ok?
Target1 extra row/col
no
yes
trce
38
SCORE Applications

7 Multimedia Applications / 279 Operators
MPEG, JPEG, Wavelet, IIR Written by Joe Yeh
Mostly feed-forward streaming
Constant consumption / production ratios,except
compressors (ZLE, Huffman)

39
Page Area
DCT, IDCT

87 of SFSMs are smaller than 512 LUTs by
design
FSMs small ? Datapaths dominate in most large
pages

40
Page Speed
43
47
10

FSMs (flow control) are fast, never critical
Queues are critical for 1/3 fastest pages ?
Datapaths dominate

41
Outline

Streaming for Hardware
From Programming Model to Hardware Model
Synthesis Methodology for FPGA
Streams, Queues, SFSM Logic
Characterization of 7 Multimedia Apps
Optimizations
Pipelining, Placement, Queue Sizing, Decomposition

42
Improving Performance, Area

Local (module) Optimization
Traditional compiler optimization
(const folding, CSE, etc.)
Datapath pipelining / loop scheduling
Granularity transforms
(composition / decomposition)
System Level Optimization
Interconnect pipelining
Shrink / remove queues
Area-time transformations
(rate matching, serialization, parallelization)

43
Pipelining With Streams

Datapath pipelining
Add registers at output (or input)
Retime into place
Harder in practice (FSM, cycles)
Add registers at strategic locations
Rewrite control
Avoid violating communication protocol
Stream pipelining
Add registers on streams
Retime into datapath
Modify queues, not processes

44
Logic Pipelining

Add L pipeline registers to D, V
Retime backwards
This pipelines feed-forward parts of producers
data-path
Stale flow control may overflow queue (by L)
Modify queue to emit back-pressure when empty
slots L
No manual modification of processes

Retime
D (Data)
D
Producer
QueuewithL Reserve
Consumer
V (Valid)
V
B (Backpressure)
B
45
Logic Relaying Retiming

Break-up deep logic in a process
Relay through enabled register queue(s)
Retime registers into adjacent process
This pipelines feed-forward parts of processs
datapath
Can retime into producer or consumer
No manual modification of processes

Retime
D
D
D
Producer
Original Queue
Consumer
V
V
V
en
B
B
B
46
Benefits, Limitations

Benefits
Simple to implement, relies only on retiming
Sufficient for many cases, e.g. DCT, IDCT
Limitations
Feed-forward only (weaker than loop sched.)
Resource sharing obfuscates retiming
opportunities
Extends to interconnect pipelining
Do not retime into logic register placement
only
Also pipeline B, modify queue

47
Pipelining Configuration

Pipeline depth parameters LiLpLr
Uniform pipelining same depths for every stream

48
Speedup from Logic Pipelining
Enabled Regs (Lr)
D FFs (Lp)
49
Expansion from Logic Pipelining
Enabled Regs (Lr)
D FFs (Lp)
50
Some Things Are Better Left Unpipelined

Pagespeedup
Pageexpansion
Initially fastpages shouldnot be pipelined

51
Page Specific Logic Pipelining

Separate pipelining of each SFSM
Assumptionapplication speed slowest page
speed
Critical Page
Repeatedly improve slowest pageuntil no further
improvement is possible
Page improvement heuristics
Greedy Lr Add one level of pipelining in 00Lr
Greedy Lp Add one level of pipelining in 1Lp0
Max Pipeline to best page speed (brute force)
Greedy heuristics may end early
Non-monotonicity adding a level of pipelining
may slow page

52
Speedup from Page Specific
Enabled Regs (Lr)
D FFs (Lp)
53
Expansion from Page Specific
Enabled Regs (Lr)
D FFs (Lp)
54
Interconnect Delay

Critical routing delay grows with circuit size
Routing delay for an application avg. 45 - 56
Routing delay for its slowest page avg. 40 -
50
Ratio (appl. to slowest page) avg. 0.99x -
1.34x
Averaged over 7 apps / varies with logic
pipelining
Modular design helps
Retain critical routing delay of page, not
application
Page-to-page delays (streams) can be pipelined

55
Interconnect Pipelining

Add W pipeline registers to D, V, B
Mobile registers for placer Not retimable
Stale flow control may overflow queue (by 2W)
Staleness total delay on B-V feedback loop
2W
Modify downstream queue to emit back-pressure
when empty slots 2W

Long distance
D (Data)
D
Producer
Queue with 2W Reserve
Consumer
V (Valid)
V
B (Backpressure)
B
56
Speedup from Interconnect Pipelining
57
Speedup from Interconnect Pipelining, No Area
Constraint
58
Expansion from Interconnect Pipelining, No Area
Constraint
59
Interconnect Register Allocation

Commercial FPGAs / tool flows
No dedicated interconnect registers
Allocation add to netlist, slice pack,
place-and-route
If pack registers with logic ? limited register
mobility
If pack registers alone ? area overhead
Better Post-placement register allocation
Weaver et al., Post-Placement C-Slow Retiming
for the Xilinx Virtex FPGA, FPGA 2003
Allocation PAR, c-slow, retime, scavenge
registers, reroute
No area overhead (scavenge registers from
existing placement)
Better performance, since know routing delay
Modification for streaming
PAR, pipeline, retime, scavenge registers,
reroute,modify queue depths (configuration
specialization)

60
Throughput Modeling

Pipelining feedback loops may reduce throughput
(tokens per clock period)
Which loops / streams are critical?
Throughput model for PN
Feedback cycle C withM tokens, N pipe
delays,has token period TC M/N
Overall token period T maxC TC
Available slack CycleSlackC (T - TC)
Generalize to multi-rate, dynamic rate by
unfolding equivalent single-rate PN

TC1 3
TC2 2
61
Throughput Aware Optimizations

Throughput aware placement
Adapt SinghBrown, FPGA 2002
Stream slack Te maxC s.t. e?C TC
Stream net criticality crit 1 - ((T - Te) / T)
Throughput aware pipelining
Pipeline stream w/o exceeding slack
Pipeline module s.t. depth does not exceed any
output stream slack
Pipeline balancing (by retiming)
Process Serialization
Serial arithmetic for process with low
throughput, high slack

62
Stream Buffer Sizing

Fixed size buffers in hardware
For minimum area (want smallest feasible queue)
For performance (want deep enough to avoid stalls
from producer-consumer timing mismatch)
Semantic gap
Buffers are unbounded in TDF,
bounded in HW
Small buffer may create artificial deadlock
(bufferlock)
Theorem memory bound is undecidable for a
Turing complete process network
In practice, our buffering requirements are
small

Bounded
Unbounded
63
Dealing with Undecidability

Handle unbounded streams
Buffer expansion Parks 95
Detect bufferlock, expand buffers
Hardware implementation
Buffer expansion rewire to another queue
Storage in off-chip memory or queue bank
Guarantee depth bound for some cases
User depth annotation
Analysis
Identify compatible SFSMs with balanced
schedules
Detect bufferlock and fail

64
Interface Automata
de Alfaro Henzinger, Symp. Found. SW Eng. (FSE)
2001

A finite state machine that transitions on I/O
actions
Not input-enabled (not every I/O on every
cycle)
G (V, E, Ai, Ao, Ah, Vstart)
Ai input actions x? (in CSP notation)
Ao output actions y!
Ah internal actions z
E ? V x (Ai ? Ao ? Ah) x V (transition on action)
Execution trace (v, a, v, a, ) (non-determinist
ic branching)

65
Automata Composition

Composition product FSM with synchronization (r
endezvous) on common actions

Composition edges (I) step A on unshared
action (ii) step B on unshared action (iii)
step both on shared action
Compatible Composition ? Bounded Memory
66
Stream Buffer Bounds Analysis

Given a process network, find minimumbuffer
sizes to avoid bufferlock
Buffer (queue) is also automaton
Symbolic Parks algorithm
Compose network using arbitrary buffer sizes
If deadlock, try larger sizes
Practical considerationsavoiding state
explosion
Multi-action automata
Know which streams to expand first
Compose pairwise in clever order
Composition is associative
Cull states reaching deadlock
Partition system

67
SFSM Decomposition (Partitioning)

Why decompose
To improve locality
To fit into custom page resources
Decomposition by state clustering
1 state (i.e. 1 cluster) active at a time
Cluster states to contain transitions
Fast local transitions, slow external trans.
Formulation minimize cut of transition
probability under area, I/O constraints
Similar to
VLIW trace scheduling Fisher 81
FSM decomp. for low power Benini/DeMicheli ISCAS
98
GarpCC HW/SW partitioning Callahan 00
VM/cache code placement

68
Early SFSM Decomposition Results

Approach 1 Balanced, multi-way, min-cut
Modified Wong FBB YangWong, ACM 94
Edge weight is mix c(transition probability)
(1-c)(wire bits)
Poor at simultaneous I/O constraint cut
optimization
Approach 2 Spectral order Extent cover
Spectral ordering clusters connected components
in 1D
Minimize squared weighted distance, weight is mix
(as above)
Then choose area I/O feasible extents start,
end using dynamic programming
Effective for partitioning to custom page
resources
Under 2 external transitions
Amdahls law few slow transitions ? small
performance loss
Achievable with either approach

69
Summary

Streaming addresses large system design
challenges
Growing interconnect delay Design complexity
Flexibly timed module interfaces Reuse
Methodology to compile SCORE applications to
Virtex-II Pro
Language compiler support for streaming
Characterized 7 applications on Virtex-II-Pro
Queue area 38 Flow control FSM area 6
Improve by merging SFSMs, eliminating queues
Stream pipelining
For logic For interconnect
Stream based optimizations
Pipelining Queue sizing Module Merging
Partitioning
Placement Serialization

70
Supplemental Material
71
TDF ? Dataflow Process Networks

Dataflow Process NetworksLeeParks, IEEE May
95
Process enabled by set of firing rules R r1,
r2, , RK
Firing rule set of input patterns ri ( ri,1,
ri,2 , , ri,M )
DF process for a TDF operator
Feedback arc for state
Firing rule(s) per state
Patterns match state input presence
E.g. for state ? r? ( ?, r?,1, r?,2 , )
Patterns r?,j if input j is in input
signature of state ? r?,j ? if input j is not
in input signature of state ?
Single firing rule per state DFPN sequential
firing rules
Multiple firing rules per state translate the
same way, with restrictions to retain
determinism

72
SFSM Partitioning Transform

Only 1 partition active at a time
Transform to activate via streams
New state in each partition wait
Used when not active
Waits for activation from other partition(s)
Has one input signature (firing rule) per
activator
Firing rules are not sequential,but determinism
guaranteed
Only 1 possible activator
Activation streams fromgiven source to given
dest.partitions can be merged binary-encoded

73
Virtual Paged Hardware (SCORE)