Title: Design Automation for Streaming Systems
1Design Automationfor Streaming Systems
- Eylon Caspi
- University of California, Berkeley
- 12/2/05
2Outline
- Streaming for Hardware
- From Programming Model to Hardware Model
- Synthesis Methodology for FPGA
- Streams, Queues, SFSM Logic
- Characterization of 7 Multimedia Apps
- Optimizations
- Pipelining, Placement, Queue Sizing, Decomposition
3Large System Design Challenges
- Devices growing with Moores Law
- AMD Opteron dual core CPU 230M transistors
- Xilinx Virtex 4 / Altera Stratix-II FPGAs 200K
LUTs - Problems of DSM, large systems
- Growing interconnect delay, timing closure
- Routing delays typically account for 45 to 65
of the total path delays (Xilinx Constraints
Guide) - Slow place-and-route
- Design complexity
- Designs do not scale well on next gen. device
must redesign - Same problems in FPGAs
4Limitations of RTL
- RTL Register Transfer Level
- Fully exposed timing behavior
- always _at_(posedge clk) ...
- Laborious, error prone
- Unpredictable interconnect delay
- How deep to pipeline?
- Redesign on next-gen device
- Undermines reuse
- Existing solutions
- Modular design Floorplanning
- Physical synthesis Hierarchical CAD
- Latency insensitive communication
5Streams
- A better communication abstraction
- Streams connect modules
- FIFO buffered channel (queue)
- Blocking read
- Timing independent (deterministic)
- Robust to communication delay
- Pipeline across long distances
- Robust to unknown delay
- Post-placement pipelining
- Alternate transport (packet switched NOC)
- Flexibly timed module interfaces
- Robust to module optimization (pipeline,
reschedule, etc.) - Enhances modular design reuse
6Streaming Applications
- Persistent compute structure (infrequent
changes) - Large data sets, mostly sequential access
- Limited feedback
- Implement with deep,system level pipelining
- E.g. DSP, multimedia
- JPEG Encoder
12/2/05
6
7Ad Hoc Streaming
- Every module needs streaming flow control
- Block if inputs not available, output not ready
to receive - Every stream needs queueing
- Pipeline to match interconnect delay
- Queue to absorb delay mismatch, dynamic rates
- Manual implementation, in HDL
- Laborious (flow control, queues)
- Error prone (deadlock if violate protocol, queue
too small) - No automation (pipeline depth, queue choice /
width / depth) - Interconnect / queue IP (e.g. OCP / Sonics Bus)
- Still no automation
8Systematic Streaming
- Strong stream semantics Process Networks
- Stream FIFO channel with (flavor of) blocking
read - E.g. Kahn Process Networks,E.g. Dataflow
Process Networks (E.A.Lee) - Streams as programming primitive
- Language support hides flow control
- Compiler support
- Compiler generated flow control
- Compiler controlled pipelining, queue depth,
queue impl. - Compiler optimizations (e.g. module merging,
partitioning) - Benefits
- Easy, correct, high performance Portable
- Paging / Virtualization is a logical extension
(Automatic page partitioning)
9Outline
- Streaming for Hardware
- From Programming Model to Hardware Model
- Synthesis Methodology for FPGA
- Streams, Queues, SFSM Logic
- Characterization of 7 Multimedia Apps
- Optimizations
- Pipelining, Placement, Queue Sizing, Decomposition
10SCORE Model
Stream Computations Organized
for Reconfigurable Execution
- Application Graph of stream-connected
operators - Operator Process with local state
- Stream FIFO channel, unbounded
capacity, blocking read - Segment Memory, accessed via streams
- Dynamics
- Dynamic I/O rates
- Dynamic graph construction
- (omitted in this work)
11SCORE Programming Model TDF
- TDF behavioral language for
- SFSM Operators (Streaming Extended Finite
State Machine) - Static operator graphs
- State machine for
- Firing control Sequencing, branching
- Firing semantics
- In state X, wait for Xs inputs, then evaluate
Xs action
i
j
state foo (i, j) o i j goto bar
o
12SCORE / TDF Process Networks
- A process from M inputs to N outputs,unified
stream type S (i.e. SM?SN) - SFSM (?, ?0, ?, R, fNS, fO)
- ? Set of states
- ?0 ? ? Initial state
- ? ? ? Present state
- R ? (? ? SM) Set of firing rules
- fNS R?? Next state function
- fO R?SN Output function
- Similar to dataflow process networks LeeParks,
IEEE May 95,but with stateful actors
13Related Streaming Models
- Streaming Models
- Kahn PN, DFPN, BDF, SDF, CSDF, HDF,StreamsC,
YAPI, Catapult C, SHIM - Streaming Platforms
- Pleiades, Philips VSP, Imagine, TRIPS
- How do we differ?
- Stateful processes
- Deterministic
- Dynamic dataflow rates (FSM nodes)
- Direct synthesis to hardware
- Bounded Buffers
14Streaming Platforms
- FPGA (this work)
- Paged FPGA
- Page fixed size partition, connected by
streams - Stylized page-to-page interconnect
- Hierarchical PAR
- Paged, Virtual FPGA (SCORE)
- Time shared pages
- Area abstraction (virtually large)
- Multiprocessor on Chip
- Heterogeneous
15The Compilation Problem
- Programming Model TDF Execution Model FPGA
- Communicating SFSMs Single circuit /
configuration - - unrestricted size, IOs, timing - one or
more clocks - Unbounded stream buffering Fixed size queues
Compile
Big semantic gap
16The Semantic Gap
- Semantic gap between TDF, HW
- Need to bind
- Stream protocol
- Stream pipelining
- Queue implementation
- Queue depths
- SFSM synthesis style (behavioral synthesis)
- Memory allocation
- Primary I/O
- SCORE device binds some implementation decisions
(custom hardware), raw FPGA does not - Want to characterize cost of implementation
decisions
17Outline
- Streaming for Hardware
- From Programming Model to Hardware Model
- Synthesis Methodology for FPGA
- Streams, Queues, SFSM Logic
- Characterization of 7 Multimedia Apps
- Optimizations
- Pipelining, Placement, Queue Sizing, Decomposition
18Compilation Tool Flow
Application
- Local optimization
- System optimization
- Queue sizing
- Pipeline extraction
- SFSM partitioning / merging
- Pipelining
- Generate flow ctl, streams, queues
TDF
tdfc
Verilog
Synplify
- Behavioral Synthesis
- Retiming
EDIF (Unplaced LUTs, etc.)
Xilinx ISE
- Slice packing
- Place and route
Device Configuration
Bits
19Wire Protocol for Streams
- D Data, V Valid, B Backpressure
- Synchronous transaction protocol
- Producer asserts V when D ready, Consumer
deasserts B when ready - Transaction commits if (?B ? V) at clock edge
- Encode EOS E as extra D bit (out of band, easy
to enqueue)
20Operator Firing
- In state X, fire if
- Inputs desired by X are ready (Valid, EOS)
- Outputs emitted by X are ready (Backpressure)
- Firing guard / control flow
- if (iv !ie !ob) begin ib0 ov1
...end - Subtlety master, slave
- Operator is slave
- To synchronize streams, (1) wait for flow
control in, (2) fire / emit out - Connecting two slaves would deadlock
- Need master (queue) between every pair of
operators
21SFSMSynthesis
FSM
Datapath For State 1
Datapath For State 2
- Implemented as Behavioral Verilog, using state
case in FSM and DP - FSM handles firing control, branching
- FSM sends state to DP
- DP sends bool. flags to FSM
Datapath
22FSM Module, Firing Control
foo (input unsigned16 x, input
unsigned16 y, output unsigned16 o)
state one (x, eos(y)) ox1 ...
module foo_fsm (clock, reset, x_e, x_v, x_b, y_e,
y_v, y_b, o_e, o_v,
o_b, state, statecase) ... always _at_ begin
x_b_1 y_b_1 o_e_0 o_v_0
state_reg_ state_reg statecase_
statecase_stall did_goto_ 0 case
(state_reg) state_one begin if
(x_v !x_e y_v y_e !o_b) begin
statecase_ statecase_1 x_b_0
y_b_0 o_v_1 o_e_0 end
... end // always _at_ endmodule // foo_fsm
Default is stall Firing condition(s)for state
one Stream flow ctlfor state one
23Data-Path Module
- TDF
- Verilog
- Data-path
- Module
foo (input unsigned16 x, input
unsigned16 y, output unsigned16 o)
state one (x, eos(y)) ox1 ...
module foo_dp (clock, reset, x_d, y_d, o_d,
state, statecase) ... always _at_ begin
o_d_16bx did_goto_ 0 case (state)
state_one begin if (statecase_
statecase_1) begin o_d_ (x_d
1d1) end ... end // always
_at_ endmodule // foo_dp
Default is stall Firing condition(s)for state
one Data-pathfor state one
24Stream Buffers (Queues)
- Systolic
- Cascade of depth-1stages (or depth-N)
- Shift register
- Put shift all entries
- Get tail pointer
- Circular buffer
- Memory withhead / tail pointers
25Enabled Register Queue
- Systolic, depth-1 stage
- 1 state bit (empty/full) V
- Shift in data unless
- Full and downstream not readyto consume queued
element - Area ? 1 FF per data bit
- On FPGA ? 1 LUT cell per data bit
- Depth-1 (single stage) nearly free,since FFs
pack with logic - Speed as fast as FF
- But combinationally connects producer consumer
via B
26Xilinx SRL16
- SRL16 Shift register of depth 16 in one 4-LUT
cell - Shift register of arbitrary width parallel
SRL16, arbitrary depth cascade SRL16 - Improve queue density by 16x
27Shift Register Queue
- State empty bit capacity counter
- Data stored in shift register
- In at position 0
- Out at position Address
- Address number of stored elements minus 1
- Synplify infers SRL16E from Verilog array
- Parameterized depth, width
- Flow control
- ov (StateNon-Empty)
- ib !(AddressDepth-1)
- Performance improvements
- Registered data out
- Registered flow control
- Specialized, pre-computed fullness
28SRL Queue with Registered Data Out
- Registered data out
- od (clock-to-Q delay)
- Non-retimable
- Data output register extends shift register
- Bypass shift register when queue empty
- 3 States
- Address numberof stored elementsminus 2
- Flow control
- ov !(StateEmpty)
- ib (Address Depth-2)
29SRL Queue with Registered Flow Ctl.
- Registered flow ctl.
- ov (clock-to-Q delay)
- ib (clock-to-Q delay)
- Non-retimable
- Flow control
- ov_next !(State_next
Empty) - ib_next (Address_next
Depth-2) - Based on pre-computed fullness
- full_next (Address_next
Depth-2)
30SRL Queue with Specialized,Pre-Computed Fullness
- Speed up critical full pre-computation by
special-casing full_next for each state - Flow control
- ov_next !(State_next
Empty) - ib_next full_next
- zero pre-computation is less critical
- Result
- gt200MHz unless very large (e.g. 128 x 128)
- All output delays are clock-to-Q
- Area ? 3 x (SRL16E area)
31SRL Queue Speed
32SRL Queue Area
33Page Synthesis
- Page Cluster of Operator(s) Queues
- SFSMs
- One or more per page
- Further decomposed into FSM, data-path
- Page Input Queues
- Deep
- Drain pipelined page-to-page streams before
reconfiguration - In-page Queues
- Shallow
- Separately Synthesizable Modules
- Separate characterization
- Consider custom resources
34Page Synthesis
35Outline
- Streaming for Hardware
- From Programming Model to Hardware Model
- Synthesis Methodology for FPGA
- Streams, Queues, SFSM Logic
- Characterization of 7 Multimedia Apps
- Optimizations
- Pipelining, Placement, Queue Sizing, Decomposition
36Tool Flow, Revisited
Application
- Separate compilationfor application, SFSMs
- Page SFSM
- Datapath FSM
Tool Options
- Identical queuing for every stream (SRL16
based, depth 16) - I/O boundary regs (for Xilinx static
timing analysis)
tdfc
Verilog
- Synplify 8.0
- Target 200MHz
- Optimize FSM, retiming, pipelining
- Retain monolithic FSM encodings
Synplify
EDIF (Unplaced LUTs, etc.)
- ISE 6.3i
- Constrain to minimum square area, at least
max slice packing 20, expand if fail PAR
Xilinx ISE
Device Configuration
Bits
37PAR Flow for Minimum Area
- EDIF netlist from Synplify
- Constraints file
- Page area
- Target Period
- ngdbuild
- Convert netlist EDIF ? NGD
- map
- Pack LUTs, MUXes, etc. into slices
- trce (pre-PAR)
- Static timing analysis, logic only
- par
- Place and route
- trce (post-PAR)
- Static timing analysis
EDIF
Constraints
ngdbuild
map
Ok?
Target packed slices
no
yes
trce
Targetpacked timing
par
Ok?
Target1 extra row/col
no
yes
trce
38SCORE Applications
- 7 Multimedia Applications / 279 Operators
- MPEG, JPEG, Wavelet, IIR Written by Joe Yeh
- Mostly feed-forward streaming
- Constant consumption / production ratios,except
compressors (ZLE, Huffman)
39Page Area
DCT, IDCT
- 87 of SFSMs are smaller than 512 LUTs by
design - FSMs small ? Datapaths dominate in most large
pages
40Page Speed
43
47
10
- FSMs (flow control) are fast, never critical
- Queues are critical for 1/3 fastest pages ?
Datapaths dominate
41Outline
- Streaming for Hardware
- From Programming Model to Hardware Model
- Synthesis Methodology for FPGA
- Streams, Queues, SFSM Logic
- Characterization of 7 Multimedia Apps
- Optimizations
- Pipelining, Placement, Queue Sizing, Decomposition
42Improving Performance, Area
- Local (module) Optimization
- Traditional compiler optimization
- (const folding, CSE, etc.)
- Datapath pipelining / loop scheduling
- Granularity transforms
- (composition / decomposition)
- System Level Optimization
- Interconnect pipelining
- Shrink / remove queues
- Area-time transformations
- (rate matching, serialization, parallelization)
43Pipelining With Streams
- Datapath pipelining
- Add registers at output (or input)
- Retime into place
- Harder in practice (FSM, cycles)
- Add registers at strategic locations
- Rewrite control
- Avoid violating communication protocol
- Stream pipelining
- Add registers on streams
- Retime into datapath
- Modify queues, not processes
44Logic Pipelining
- Add L pipeline registers to D, V
- Retime backwards
- This pipelines feed-forward parts of producers
data-path - Stale flow control may overflow queue (by L)
- Modify queue to emit back-pressure when empty
slots L - No manual modification of processes
Retime
D (Data)
D
Producer
QueuewithL Reserve
Consumer
V (Valid)
V
B (Backpressure)
B
45Logic Relaying Retiming
- Break-up deep logic in a process
- Relay through enabled register queue(s)
- Retime registers into adjacent process
- This pipelines feed-forward parts of processs
datapath - Can retime into producer or consumer
- No manual modification of processes
Retime
D
D
D
Producer
Original Queue
Consumer
V
V
V
en
B
B
B
46Benefits, Limitations
- Benefits
- Simple to implement, relies only on retiming
- Sufficient for many cases, e.g. DCT, IDCT
- Limitations
- Feed-forward only (weaker than loop sched.)
- Resource sharing obfuscates retiming
opportunities - Extends to interconnect pipelining
- Do not retime into logic register placement
only - Also pipeline B, modify queue
47Pipelining Configuration
- Pipeline depth parameters LiLpLr
- Uniform pipelining same depths for every stream
48Speedup from Logic Pipelining
Enabled Regs (Lr)
D FFs (Lp)
49Expansion from Logic Pipelining
Enabled Regs (Lr)
D FFs (Lp)
50Some Things Are Better Left Unpipelined
- Pagespeedup
- Pageexpansion
- Initially fastpages shouldnot be pipelined
51Page Specific Logic Pipelining
- Separate pipelining of each SFSM
- Assumptionapplication speed slowest page
speed - Critical Page
- Repeatedly improve slowest pageuntil no further
improvement is possible - Page improvement heuristics
- Greedy Lr Add one level of pipelining in 00Lr
- Greedy Lp Add one level of pipelining in 1Lp0
- Max Pipeline to best page speed (brute force)
- Greedy heuristics may end early
- Non-monotonicity adding a level of pipelining
may slow page
52Speedup from Page Specific
Enabled Regs (Lr)
D FFs (Lp)
53Expansion from Page Specific
Enabled Regs (Lr)
D FFs (Lp)
54Interconnect Delay
- Critical routing delay grows with circuit size
- Routing delay for an application avg. 45 - 56
- Routing delay for its slowest page avg. 40 -
50 - Ratio (appl. to slowest page) avg. 0.99x -
1.34x - Averaged over 7 apps / varies with logic
pipelining - Modular design helps
- Retain critical routing delay of page, not
application - Page-to-page delays (streams) can be pipelined
55Interconnect Pipelining
- Add W pipeline registers to D, V, B
- Mobile registers for placer Not retimable
- Stale flow control may overflow queue (by 2W)
- Staleness total delay on B-V feedback loop
2W - Modify downstream queue to emit back-pressure
when empty slots 2W
Long distance
D (Data)
D
Producer
Queue with 2W Reserve
Consumer
V (Valid)
V
B (Backpressure)
B
56Speedup from Interconnect Pipelining
57Speedup from Interconnect Pipelining, No Area
Constraint
58Expansion from Interconnect Pipelining, No Area
Constraint
59Interconnect Register Allocation
- Commercial FPGAs / tool flows
- No dedicated interconnect registers
- Allocation add to netlist, slice pack,
place-and-route - If pack registers with logic ? limited register
mobility - If pack registers alone ? area overhead
- Better Post-placement register allocation
- Weaver et al., Post-Placement C-Slow Retiming
for the Xilinx Virtex FPGA, FPGA 2003 - Allocation PAR, c-slow, retime, scavenge
registers, reroute - No area overhead (scavenge registers from
existing placement) - Better performance, since know routing delay
- Modification for streaming
- PAR, pipeline, retime, scavenge registers,
reroute,modify queue depths (configuration
specialization)
60Throughput Modeling
- Pipelining feedback loops may reduce throughput
(tokens per clock period) - Which loops / streams are critical?
- Throughput model for PN
- Feedback cycle C withM tokens, N pipe
delays,has token period TC M/N - Overall token period T maxC TC
- Available slack CycleSlackC (T - TC)
- Generalize to multi-rate, dynamic rate by
unfolding equivalent single-rate PN
TC1 3
TC2 2
61Throughput Aware Optimizations
- Throughput aware placement
- Adapt SinghBrown, FPGA 2002
- Stream slack Te maxC s.t. e?C TC
- Stream net criticality crit 1 - ((T - Te) / T)
- Throughput aware pipelining
- Pipeline stream w/o exceeding slack
- Pipeline module s.t. depth does not exceed any
output stream slack - Pipeline balancing (by retiming)
- Process Serialization
- Serial arithmetic for process with low
throughput, high slack
62Stream Buffer Sizing
- Fixed size buffers in hardware
- For minimum area (want smallest feasible queue)
- For performance (want deep enough to avoid stalls
from producer-consumer timing mismatch) - Semantic gap
- Buffers are unbounded in TDF,
bounded in HW - Small buffer may create artificial deadlock
(bufferlock) - Theorem memory bound is undecidable for a
Turing complete process network - In practice, our buffering requirements are
small
Bounded
Unbounded
63Dealing with Undecidability
- Handle unbounded streams
- Buffer expansion Parks 95
- Detect bufferlock, expand buffers
- Hardware implementation
- Buffer expansion rewire to another queue
- Storage in off-chip memory or queue bank
- Guarantee depth bound for some cases
- User depth annotation
- Analysis
- Identify compatible SFSMs with balanced
schedules - Detect bufferlock and fail
64Interface Automata
de Alfaro Henzinger, Symp. Found. SW Eng. (FSE)
2001
- A finite state machine that transitions on I/O
actions - Not input-enabled (not every I/O on every
cycle) - G (V, E, Ai, Ao, Ah, Vstart)
- Ai input actions x? (in CSP notation)
- Ao output actions y!
- Ah internal actions z
- E ? V x (Ai ? Ao ? Ah) x V (transition on action)
- Execution trace (v, a, v, a, ) (non-determinist
ic branching)
65Automata Composition
- Composition product FSM with synchronization (r
endezvous) on common actions
Composition edges (I) step A on unshared
action (ii) step B on unshared action (iii)
step both on shared action
Compatible Composition ? Bounded Memory
66Stream Buffer Bounds Analysis
- Given a process network, find minimumbuffer
sizes to avoid bufferlock - Buffer (queue) is also automaton
- Symbolic Parks algorithm
- Compose network using arbitrary buffer sizes
- If deadlock, try larger sizes
- Practical considerationsavoiding state
explosion - Multi-action automata
- Know which streams to expand first
- Compose pairwise in clever order
- Composition is associative
- Cull states reaching deadlock
- Partition system
67SFSM Decomposition (Partitioning)
- Why decompose
- To improve locality
- To fit into custom page resources
- Decomposition by state clustering
- 1 state (i.e. 1 cluster) active at a time
- Cluster states to contain transitions
- Fast local transitions, slow external trans.
- Formulation minimize cut of transition
probability under area, I/O constraints - Similar to
- VLIW trace scheduling Fisher 81
- FSM decomp. for low power Benini/DeMicheli ISCAS
98 - GarpCC HW/SW partitioning Callahan 00
- VM/cache code placement
68Early SFSM Decomposition Results
- Approach 1 Balanced, multi-way, min-cut
- Modified Wong FBB YangWong, ACM 94
- Edge weight is mix c(transition probability)
(1-c)(wire bits) - Poor at simultaneous I/O constraint cut
optimization - Approach 2 Spectral order Extent cover
- Spectral ordering clusters connected components
in 1D - Minimize squared weighted distance, weight is mix
(as above) - Then choose area I/O feasible extents start,
end using dynamic programming - Effective for partitioning to custom page
resources - Under 2 external transitions
- Amdahls law few slow transitions ? small
performance loss - Achievable with either approach
69Summary
- Streaming addresses large system design
challenges - Growing interconnect delay Design complexity
- Flexibly timed module interfaces Reuse
- Methodology to compile SCORE applications to
Virtex-II Pro - Language compiler support for streaming
- Characterized 7 applications on Virtex-II-Pro
- Queue area 38 Flow control FSM area 6
- Improve by merging SFSMs, eliminating queues
- Stream pipelining
- For logic For interconnect
- Stream based optimizations
- Pipelining Queue sizing Module Merging
Partitioning - Placement Serialization
70Supplemental Material
71TDF ? Dataflow Process Networks
- Dataflow Process NetworksLeeParks, IEEE May
95 - Process enabled by set of firing rules R r1,
r2, , RK - Firing rule set of input patterns ri ( ri,1,
ri,2 , , ri,M ) - DF process for a TDF operator
- Feedback arc for state
- Firing rule(s) per state
- Patterns match state input presence
- E.g. for state ? r? ( ?, r?,1, r?,2 , )
- Patterns r?,j if input j is in input
signature of state ? r?,j ? if input j is not
in input signature of state ? - Single firing rule per state DFPN sequential
firing rules - Multiple firing rules per state translate the
same way, with restrictions to retain
determinism
72SFSM Partitioning Transform
- Only 1 partition active at a time
- Transform to activate via streams
- New state in each partition wait
- Used when not active
- Waits for activation from other partition(s)
- Has one input signature (firing rule) per
activator - Firing rules are not sequential,but determinism
guaranteed - Only 1 possible activator
- Activation streams fromgiven source to given
dest.partitions can be merged binary-encoded
73Virtual Paged Hardware (SCORE)
- Compute model has unbounded resources
- Programmer does not target a particular device
size - Paging
- Compute pages swapped in/out (like virtual
memory) - Page context thread (FSM to block on stream
access) - Efficient virtualization
- Amortize reconfiguration cost over an entire
input buffer - Requires working sets of tightly-communicating
pages to fit on device
Transform
Quantize
RLE
Encode