Design Automation for Streaming Systems - PowerPoint PPT Presentation

1 / 73
About This Presentation
Title:

Design Automation for Streaming Systems

Description:

From Programming Model to Hardware Model. Synthesis ... AMD Opteron dual core CPU: ~230M transistors ... 'Routing delays typically account for 45% to 65 ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 74
Provided by: eyl3
Category:

less

Transcript and Presenter's Notes

Title: Design Automation for Streaming Systems


1
Design Automationfor Streaming Systems
  • Eylon Caspi
  • University of California, Berkeley
  • 12/2/05

2
Outline
  • Streaming for Hardware
  • From Programming Model to Hardware Model
  • Synthesis Methodology for FPGA
  • Streams, Queues, SFSM Logic
  • Characterization of 7 Multimedia Apps
  • Optimizations
  • Pipelining, Placement, Queue Sizing, Decomposition

3
Large System Design Challenges
  • Devices growing with Moores Law
  • AMD Opteron dual core CPU 230M transistors
  • Xilinx Virtex 4 / Altera Stratix-II FPGAs 200K
    LUTs
  • Problems of DSM, large systems
  • Growing interconnect delay, timing closure
  • Routing delays typically account for 45 to 65
    of the total path delays (Xilinx Constraints
    Guide)
  • Slow place-and-route
  • Design complexity
  • Designs do not scale well on next gen. device
    must redesign
  • Same problems in FPGAs

4
Limitations of RTL
  • RTL Register Transfer Level
  • Fully exposed timing behavior
  • always _at_(posedge clk) ...
  • Laborious, error prone
  • Unpredictable interconnect delay
  • How deep to pipeline?
  • Redesign on next-gen device
  • Undermines reuse
  • Existing solutions
  • Modular design Floorplanning
  • Physical synthesis Hierarchical CAD
  • Latency insensitive communication

5
Streams
  • A better communication abstraction
  • Streams connect modules
  • FIFO buffered channel (queue)
  • Blocking read
  • Timing independent (deterministic)
  • Robust to communication delay
  • Pipeline across long distances
  • Robust to unknown delay
  • Post-placement pipelining
  • Alternate transport (packet switched NOC)
  • Flexibly timed module interfaces
  • Robust to module optimization (pipeline,
    reschedule, etc.)
  • Enhances modular design reuse

6
Streaming Applications
  • Persistent compute structure (infrequent
    changes)
  • Large data sets, mostly sequential access
  • Limited feedback
  • Implement with deep,system level pipelining
  • E.g. DSP, multimedia
  • JPEG Encoder

12/2/05
6
7
Ad Hoc Streaming
  • Every module needs streaming flow control
  • Block if inputs not available, output not ready
    to receive
  • Every stream needs queueing
  • Pipeline to match interconnect delay
  • Queue to absorb delay mismatch, dynamic rates
  • Manual implementation, in HDL
  • Laborious (flow control, queues)
  • Error prone (deadlock if violate protocol, queue
    too small)
  • No automation (pipeline depth, queue choice /
    width / depth)
  • Interconnect / queue IP (e.g. OCP / Sonics Bus)
  • Still no automation

8
Systematic Streaming
  • Strong stream semantics Process Networks
  • Stream FIFO channel with (flavor of) blocking
    read
  • E.g. Kahn Process Networks,E.g. Dataflow
    Process Networks (E.A.Lee)
  • Streams as programming primitive
  • Language support hides flow control
  • Compiler support
  • Compiler generated flow control
  • Compiler controlled pipelining, queue depth,
    queue impl.
  • Compiler optimizations (e.g. module merging,
    partitioning)
  • Benefits
  • Easy, correct, high performance Portable
  • Paging / Virtualization is a logical extension
    (Automatic page partitioning)

9
Outline
  • Streaming for Hardware
  • From Programming Model to Hardware Model
  • Synthesis Methodology for FPGA
  • Streams, Queues, SFSM Logic
  • Characterization of 7 Multimedia Apps
  • Optimizations
  • Pipelining, Placement, Queue Sizing, Decomposition

10
SCORE Model
Stream Computations Organized
for Reconfigurable Execution
  • Application Graph of stream-connected
    operators
  • Operator Process with local state
  • Stream FIFO channel, unbounded
    capacity, blocking read
  • Segment Memory, accessed via streams
  • Dynamics
  • Dynamic I/O rates
  • Dynamic graph construction
  • (omitted in this work)

11
SCORE Programming Model TDF
  • TDF behavioral language for
  • SFSM Operators (Streaming Extended Finite
    State Machine)
  • Static operator graphs
  • State machine for
  • Firing control Sequencing, branching
  • Firing semantics
  • In state X, wait for Xs inputs, then evaluate
    Xs action

i
j
state foo (i, j) o i j goto bar
o
12
SCORE / TDF Process Networks
  • A process from M inputs to N outputs,unified
    stream type S (i.e. SM?SN)
  • SFSM (?, ?0, ?, R, fNS, fO)
  • ? Set of states
  • ?0 ? ? Initial state
  • ? ? ? Present state
  • R ? (? ? SM) Set of firing rules
  • fNS R?? Next state function
  • fO R?SN Output function
  • Similar to dataflow process networks LeeParks,
    IEEE May 95,but with stateful actors

13
Related Streaming Models
  • Streaming Models
  • Kahn PN, DFPN, BDF, SDF, CSDF, HDF,StreamsC,
    YAPI, Catapult C, SHIM
  • Streaming Platforms
  • Pleiades, Philips VSP, Imagine, TRIPS
  • How do we differ?
  • Stateful processes
  • Deterministic
  • Dynamic dataflow rates (FSM nodes)
  • Direct synthesis to hardware
  • Bounded Buffers

14
Streaming Platforms
  • FPGA (this work)
  • Paged FPGA
  • Page fixed size partition, connected by
    streams
  • Stylized page-to-page interconnect
  • Hierarchical PAR
  • Paged, Virtual FPGA (SCORE)
  • Time shared pages
  • Area abstraction (virtually large)
  • Multiprocessor on Chip
  • Heterogeneous

15
The Compilation Problem
  • Programming Model TDF Execution Model FPGA
  • Communicating SFSMs Single circuit /
    configuration
  • - unrestricted size, IOs, timing - one or
    more clocks
  • Unbounded stream buffering Fixed size queues

Compile
Big semantic gap
16
The Semantic Gap
  • Semantic gap between TDF, HW
  • Need to bind
  • Stream protocol
  • Stream pipelining
  • Queue implementation
  • Queue depths
  • SFSM synthesis style (behavioral synthesis)
  • Memory allocation
  • Primary I/O
  • SCORE device binds some implementation decisions
    (custom hardware), raw FPGA does not
  • Want to characterize cost of implementation
    decisions

17
Outline
  • Streaming for Hardware
  • From Programming Model to Hardware Model
  • Synthesis Methodology for FPGA
  • Streams, Queues, SFSM Logic
  • Characterization of 7 Multimedia Apps
  • Optimizations
  • Pipelining, Placement, Queue Sizing, Decomposition

18
Compilation Tool Flow
Application
  • Local optimization
  • System optimization
  • Queue sizing
  • Pipeline extraction
  • SFSM partitioning / merging
  • Pipelining
  • Generate flow ctl, streams, queues

TDF
tdfc
Verilog
Synplify
  • Behavioral Synthesis
  • Retiming

EDIF (Unplaced LUTs, etc.)
Xilinx ISE
  • Slice packing
  • Place and route

Device Configuration
Bits
19
Wire Protocol for Streams
  • D Data, V Valid, B Backpressure
  • Synchronous transaction protocol
  • Producer asserts V when D ready, Consumer
    deasserts B when ready
  • Transaction commits if (?B ? V) at clock edge
  • Encode EOS E as extra D bit (out of band, easy
    to enqueue)

20
Operator Firing
  • In state X, fire if
  • Inputs desired by X are ready (Valid, EOS)
  • Outputs emitted by X are ready (Backpressure)
  • Firing guard / control flow
  • if (iv !ie !ob) begin ib0 ov1
    ...end
  • Subtlety master, slave
  • Operator is slave
  • To synchronize streams, (1) wait for flow
    control in, (2) fire / emit out
  • Connecting two slaves would deadlock
  • Need master (queue) between every pair of
    operators

21
SFSMSynthesis
FSM
Datapath For State 1
Datapath For State 2
  • Implemented as Behavioral Verilog, using state
    case in FSM and DP
  • FSM handles firing control, branching
  • FSM sends state to DP
  • DP sends bool. flags to FSM

Datapath
22
FSM Module, Firing Control
  • TDF
  • Verilog
  • FSM
  • Module

foo (input unsigned16 x, input
unsigned16 y, output unsigned16 o)
state one (x, eos(y)) ox1 ...
module foo_fsm (clock, reset, x_e, x_v, x_b, y_e,
y_v, y_b, o_e, o_v,
o_b, state, statecase) ... always _at_ begin
x_b_1 y_b_1 o_e_0 o_v_0
state_reg_ state_reg statecase_
statecase_stall did_goto_ 0 case
(state_reg) state_one begin if
(x_v !x_e y_v y_e !o_b) begin
statecase_ statecase_1 x_b_0
y_b_0 o_v_1 o_e_0 end
... end // always _at_ endmodule // foo_fsm
Default is stall Firing condition(s)for state
one Stream flow ctlfor state one
23
Data-Path Module
  • TDF
  • Verilog
  • Data-path
  • Module

foo (input unsigned16 x, input
unsigned16 y, output unsigned16 o)
state one (x, eos(y)) ox1 ...
module foo_dp (clock, reset, x_d, y_d, o_d,
state, statecase) ... always _at_ begin
o_d_16bx did_goto_ 0 case (state)
state_one begin if (statecase_
statecase_1) begin o_d_ (x_d
1d1) end ... end // always
_at_ endmodule // foo_dp
Default is stall Firing condition(s)for state
one Data-pathfor state one
24
Stream Buffers (Queues)
  • Systolic
  • Cascade of depth-1stages (or depth-N)
  • Shift register
  • Put shift all entries
  • Get tail pointer
  • Circular buffer
  • Memory withhead / tail pointers

25
Enabled Register Queue
  • Systolic, depth-1 stage
  • 1 state bit (empty/full) V
  • Shift in data unless
  • Full and downstream not readyto consume queued
    element
  • Area ? 1 FF per data bit
  • On FPGA ? 1 LUT cell per data bit
  • Depth-1 (single stage) nearly free,since FFs
    pack with logic
  • Speed as fast as FF
  • But combinationally connects producer consumer
    via B

26
Xilinx SRL16
  • SRL16 Shift register of depth 16 in one 4-LUT
    cell
  • Shift register of arbitrary width parallel
    SRL16, arbitrary depth cascade SRL16
  • Improve queue density by 16x

27
Shift Register Queue
  • State empty bit capacity counter
  • Data stored in shift register
  • In at position 0
  • Out at position Address
  • Address number of stored elements minus 1
  • Synplify infers SRL16E from Verilog array
  • Parameterized depth, width
  • Flow control
  • ov (StateNon-Empty)
  • ib !(AddressDepth-1)
  • Performance improvements
  • Registered data out
  • Registered flow control
  • Specialized, pre-computed fullness

28
SRL Queue with Registered Data Out
  • Registered data out
  • od (clock-to-Q delay)
  • Non-retimable
  • Data output register extends shift register
  • Bypass shift register when queue empty
  • 3 States
  • Address numberof stored elementsminus 2
  • Flow control
  • ov !(StateEmpty)
  • ib (Address Depth-2)

29
SRL Queue with Registered Flow Ctl.
  • Registered flow ctl.
  • ov (clock-to-Q delay)
  • ib (clock-to-Q delay)
  • Non-retimable
  • Flow control
  • ov_next !(State_next
    Empty)
  • ib_next (Address_next
    Depth-2)
  • Based on pre-computed fullness
  • full_next (Address_next
    Depth-2)

30
SRL Queue with Specialized,Pre-Computed Fullness
  • Speed up critical full pre-computation by
    special-casing full_next for each state
  • Flow control
  • ov_next !(State_next
    Empty)
  • ib_next full_next
  • zero pre-computation is less critical
  • Result
  • gt200MHz unless very large (e.g. 128 x 128)
  • All output delays are clock-to-Q
  • Area ? 3 x (SRL16E area)

31
SRL Queue Speed
32
SRL Queue Area
33
Page Synthesis
  • Page Cluster of Operator(s) Queues
  • SFSMs
  • One or more per page
  • Further decomposed into FSM, data-path
  • Page Input Queues
  • Deep
  • Drain pipelined page-to-page streams before
    reconfiguration
  • In-page Queues
  • Shallow
  • Separately Synthesizable Modules
  • Separate characterization
  • Consider custom resources

34
Page Synthesis
  • Module Hierarchy

35
Outline
  • Streaming for Hardware
  • From Programming Model to Hardware Model
  • Synthesis Methodology for FPGA
  • Streams, Queues, SFSM Logic
  • Characterization of 7 Multimedia Apps
  • Optimizations
  • Pipelining, Placement, Queue Sizing, Decomposition

36
Tool Flow, Revisited
Application
  • Separate compilationfor application, SFSMs
  • Page SFSM
  • Datapath FSM

Tool Options
  • Identical queuing for every stream (SRL16
    based, depth 16)
  • I/O boundary regs (for Xilinx static
    timing analysis)

tdfc
Verilog
  • Synplify 8.0
  • Target 200MHz
  • Optimize FSM, retiming, pipelining
  • Retain monolithic FSM encodings

Synplify
EDIF (Unplaced LUTs, etc.)
  • ISE 6.3i
  • Constrain to minimum square area, at least
    max slice packing 20, expand if fail PAR

Xilinx ISE
Device Configuration
Bits
  • Device XC2VP70 -7

37
PAR Flow for Minimum Area
  • EDIF netlist from Synplify
  • Constraints file
  • Page area
  • Target Period
  • ngdbuild
  • Convert netlist EDIF ? NGD
  • map
  • Pack LUTs, MUXes, etc. into slices
  • trce (pre-PAR)
  • Static timing analysis, logic only
  • par
  • Place and route
  • trce (post-PAR)
  • Static timing analysis

EDIF
Constraints
ngdbuild
map
Ok?
Target packed slices
no
yes
trce
Targetpacked timing
par
Ok?
Target1 extra row/col
no
yes
trce
38
SCORE Applications
  • 7 Multimedia Applications / 279 Operators
  • MPEG, JPEG, Wavelet, IIR Written by Joe Yeh
  • Mostly feed-forward streaming
  • Constant consumption / production ratios,except
    compressors (ZLE, Huffman)

39
Page Area
DCT, IDCT
  • 87 of SFSMs are smaller than 512 LUTs by
    design
  • FSMs small ? Datapaths dominate in most large
    pages

40
Page Speed
43
47
10
  • FSMs (flow control) are fast, never critical
  • Queues are critical for 1/3 fastest pages ?
    Datapaths dominate

41
Outline
  • Streaming for Hardware
  • From Programming Model to Hardware Model
  • Synthesis Methodology for FPGA
  • Streams, Queues, SFSM Logic
  • Characterization of 7 Multimedia Apps
  • Optimizations
  • Pipelining, Placement, Queue Sizing, Decomposition

42
Improving Performance, Area
  • Local (module) Optimization
  • Traditional compiler optimization
  • (const folding, CSE, etc.)
  • Datapath pipelining / loop scheduling
  • Granularity transforms
  • (composition / decomposition)
  • System Level Optimization
  • Interconnect pipelining
  • Shrink / remove queues
  • Area-time transformations
  • (rate matching, serialization, parallelization)

43
Pipelining With Streams
  • Datapath pipelining
  • Add registers at output (or input)
  • Retime into place
  • Harder in practice (FSM, cycles)
  • Add registers at strategic locations
  • Rewrite control
  • Avoid violating communication protocol
  • Stream pipelining
  • Add registers on streams
  • Retime into datapath
  • Modify queues, not processes

44
Logic Pipelining
  • Add L pipeline registers to D, V
  • Retime backwards
  • This pipelines feed-forward parts of producers
    data-path
  • Stale flow control may overflow queue (by L)
  • Modify queue to emit back-pressure when empty
    slots L
  • No manual modification of processes

Retime
D (Data)
D
Producer
QueuewithL Reserve
Consumer
V (Valid)
V
B (Backpressure)
B
45
Logic Relaying Retiming
  • Break-up deep logic in a process
  • Relay through enabled register queue(s)
  • Retime registers into adjacent process
  • This pipelines feed-forward parts of processs
    datapath
  • Can retime into producer or consumer
  • No manual modification of processes

Retime
D
D
D
Producer
Original Queue
Consumer
V
V
V
en
B
B
B
46
Benefits, Limitations
  • Benefits
  • Simple to implement, relies only on retiming
  • Sufficient for many cases, e.g. DCT, IDCT
  • Limitations
  • Feed-forward only (weaker than loop sched.)
  • Resource sharing obfuscates retiming
    opportunities
  • Extends to interconnect pipelining
  • Do not retime into logic register placement
    only
  • Also pipeline B, modify queue

47
Pipelining Configuration
  • Pipeline depth parameters LiLpLr
  • Uniform pipelining same depths for every stream

48
Speedup from Logic Pipelining
Enabled Regs (Lr)
D FFs (Lp)
49
Expansion from Logic Pipelining
Enabled Regs (Lr)
D FFs (Lp)
50
Some Things Are Better Left Unpipelined
  • Pagespeedup
  • Pageexpansion
  • Initially fastpages shouldnot be pipelined

51
Page Specific Logic Pipelining
  • Separate pipelining of each SFSM
  • Assumptionapplication speed slowest page
    speed
  • Critical Page
  • Repeatedly improve slowest pageuntil no further
    improvement is possible
  • Page improvement heuristics
  • Greedy Lr Add one level of pipelining in 00Lr
  • Greedy Lp Add one level of pipelining in 1Lp0
  • Max Pipeline to best page speed (brute force)
  • Greedy heuristics may end early
  • Non-monotonicity adding a level of pipelining
    may slow page

52
Speedup from Page Specific
Enabled Regs (Lr)
D FFs (Lp)
53
Expansion from Page Specific
Enabled Regs (Lr)
D FFs (Lp)
54
Interconnect Delay
  • Critical routing delay grows with circuit size
  • Routing delay for an application avg. 45 - 56
  • Routing delay for its slowest page avg. 40 -
    50
  • Ratio (appl. to slowest page) avg. 0.99x -
    1.34x
  • Averaged over 7 apps / varies with logic
    pipelining
  • Modular design helps
  • Retain critical routing delay of page, not
    application
  • Page-to-page delays (streams) can be pipelined

55
Interconnect Pipelining
  • Add W pipeline registers to D, V, B
  • Mobile registers for placer Not retimable
  • Stale flow control may overflow queue (by 2W)
  • Staleness total delay on B-V feedback loop
    2W
  • Modify downstream queue to emit back-pressure
    when empty slots 2W

Long distance
D (Data)
D
Producer
Queue with 2W Reserve
Consumer
V (Valid)
V
B (Backpressure)
B
56
Speedup from Interconnect Pipelining
57
Speedup from Interconnect Pipelining, No Area
Constraint
58
Expansion from Interconnect Pipelining, No Area
Constraint
59
Interconnect Register Allocation
  • Commercial FPGAs / tool flows
  • No dedicated interconnect registers
  • Allocation add to netlist, slice pack,
    place-and-route
  • If pack registers with logic ? limited register
    mobility
  • If pack registers alone ? area overhead
  • Better Post-placement register allocation
  • Weaver et al., Post-Placement C-Slow Retiming
    for the Xilinx Virtex FPGA, FPGA 2003
  • Allocation PAR, c-slow, retime, scavenge
    registers, reroute
  • No area overhead (scavenge registers from
    existing placement)
  • Better performance, since know routing delay
  • Modification for streaming
  • PAR, pipeline, retime, scavenge registers,
    reroute,modify queue depths (configuration
    specialization)

60
Throughput Modeling
  • Pipelining feedback loops may reduce throughput
    (tokens per clock period)
  • Which loops / streams are critical?
  • Throughput model for PN
  • Feedback cycle C withM tokens, N pipe
    delays,has token period TC M/N
  • Overall token period T maxC TC
  • Available slack CycleSlackC (T - TC)
  • Generalize to multi-rate, dynamic rate by
    unfolding equivalent single-rate PN

TC1 3
TC2 2
61
Throughput Aware Optimizations
  • Throughput aware placement
  • Adapt SinghBrown, FPGA 2002
  • Stream slack Te maxC s.t. e?C TC
  • Stream net criticality crit 1 - ((T - Te) / T)
  • Throughput aware pipelining
  • Pipeline stream w/o exceeding slack
  • Pipeline module s.t. depth does not exceed any
    output stream slack
  • Pipeline balancing (by retiming)
  • Process Serialization
  • Serial arithmetic for process with low
    throughput, high slack

62
Stream Buffer Sizing
  • Fixed size buffers in hardware
  • For minimum area (want smallest feasible queue)
  • For performance (want deep enough to avoid stalls
    from producer-consumer timing mismatch)
  • Semantic gap
  • Buffers are unbounded in TDF,
    bounded in HW
  • Small buffer may create artificial deadlock
    (bufferlock)
  • Theorem memory bound is undecidable for a
    Turing complete process network
  • In practice, our buffering requirements are
    small

Bounded
Unbounded
63
Dealing with Undecidability
  • Handle unbounded streams
  • Buffer expansion Parks 95
  • Detect bufferlock, expand buffers
  • Hardware implementation
  • Buffer expansion rewire to another queue
  • Storage in off-chip memory or queue bank
  • Guarantee depth bound for some cases
  • User depth annotation
  • Analysis
  • Identify compatible SFSMs with balanced
    schedules
  • Detect bufferlock and fail

64
Interface Automata
de Alfaro Henzinger, Symp. Found. SW Eng. (FSE)
2001
  • A finite state machine that transitions on I/O
    actions
  • Not input-enabled (not every I/O on every
    cycle)
  • G (V, E, Ai, Ao, Ah, Vstart)
  • Ai input actions x? (in CSP notation)
  • Ao output actions y!
  • Ah internal actions z
  • E ? V x (Ai ? Ao ? Ah) x V (transition on action)
  • Execution trace (v, a, v, a, ) (non-determinist
    ic branching)

65
Automata Composition
  • Composition product FSM with synchronization (r
    endezvous) on common actions

Composition edges (I) step A on unshared
action (ii) step B on unshared action (iii)
step both on shared action
Compatible Composition ? Bounded Memory
66
Stream Buffer Bounds Analysis
  • Given a process network, find minimumbuffer
    sizes to avoid bufferlock
  • Buffer (queue) is also automaton
  • Symbolic Parks algorithm
  • Compose network using arbitrary buffer sizes
  • If deadlock, try larger sizes
  • Practical considerationsavoiding state
    explosion
  • Multi-action automata
  • Know which streams to expand first
  • Compose pairwise in clever order
  • Composition is associative
  • Cull states reaching deadlock
  • Partition system

67
SFSM Decomposition (Partitioning)
  • Why decompose
  • To improve locality
  • To fit into custom page resources
  • Decomposition by state clustering
  • 1 state (i.e. 1 cluster) active at a time
  • Cluster states to contain transitions
  • Fast local transitions, slow external trans.
  • Formulation minimize cut of transition
    probability under area, I/O constraints
  • Similar to
  • VLIW trace scheduling Fisher 81
  • FSM decomp. for low power Benini/DeMicheli ISCAS
    98
  • GarpCC HW/SW partitioning Callahan 00
  • VM/cache code placement

68
Early SFSM Decomposition Results
  • Approach 1 Balanced, multi-way, min-cut
  • Modified Wong FBB YangWong, ACM 94
  • Edge weight is mix c(transition probability)
    (1-c)(wire bits)
  • Poor at simultaneous I/O constraint cut
    optimization
  • Approach 2 Spectral order Extent cover
  • Spectral ordering clusters connected components
    in 1D
  • Minimize squared weighted distance, weight is mix
    (as above)
  • Then choose area I/O feasible extents start,
    end using dynamic programming
  • Effective for partitioning to custom page
    resources
  • Under 2 external transitions
  • Amdahls law few slow transitions ? small
    performance loss
  • Achievable with either approach

69
Summary
  • Streaming addresses large system design
    challenges
  • Growing interconnect delay Design complexity
  • Flexibly timed module interfaces Reuse
  • Methodology to compile SCORE applications to
    Virtex-II Pro
  • Language compiler support for streaming
  • Characterized 7 applications on Virtex-II-Pro
  • Queue area 38 Flow control FSM area 6
  • Improve by merging SFSMs, eliminating queues
  • Stream pipelining
  • For logic For interconnect
  • Stream based optimizations
  • Pipelining Queue sizing Module Merging
    Partitioning
  • Placement Serialization

70
Supplemental Material
71
TDF ? Dataflow Process Networks
  • Dataflow Process NetworksLeeParks, IEEE May
    95
  • Process enabled by set of firing rules R r1,
    r2, , RK
  • Firing rule set of input patterns ri ( ri,1,
    ri,2 , , ri,M )
  • DF process for a TDF operator
  • Feedback arc for state
  • Firing rule(s) per state
  • Patterns match state input presence
  • E.g. for state ? r? ( ?, r?,1, r?,2 , )
  • Patterns r?,j if input j is in input
    signature of state ? r?,j ? if input j is not
    in input signature of state ?
  • Single firing rule per state DFPN sequential
    firing rules
  • Multiple firing rules per state translate the
    same way, with restrictions to retain
    determinism

72
SFSM Partitioning Transform
  • Only 1 partition active at a time
  • Transform to activate via streams
  • New state in each partition wait
  • Used when not active
  • Waits for activation from other partition(s)
  • Has one input signature (firing rule) per
    activator
  • Firing rules are not sequential,but determinism
    guaranteed
  • Only 1 possible activator
  • Activation streams fromgiven source to given
    dest.partitions can be merged binary-encoded

73
Virtual Paged Hardware (SCORE)
  • Compute model has unbounded resources
  • Programmer does not target a particular device
    size
  • Paging
  • Compute pages swapped in/out (like virtual
    memory)
  • Page context thread (FSM to block on stream
    access)
  • Efficient virtualization
  • Amortize reconfiguration cost over an entire
    input buffer
  • Requires working sets of tightly-communicating
    pages to fit on device

Transform
Quantize
RLE
Encode
Write a Comment
User Comments (0)
About PowerShow.com