Title: Forrest%20Brewer%20forrest@ece.ucsb.edu
1NDFA Based Scheduling
- Forrest Brewer, Steve Haynal
- University of California
- Santa Barbara
2Scheduling is Behavioral Synthesis
- Exploits fundamental freedom -- ordering and
binding of operations, operands - Subdivided into DFG transformation, resource
allocation, time-scheduling, operation binding,
memory binding, communication binding, resource
modeling, reallocation... - Complexity of tasks requires top-down flow -- yet
evaluations/constraints are bottom-up - Behavioral Synthesis difficult to use!
- Seemingly trivial changes cause vast output
changes - Design tradeoffs tied to a particular point
language (VHDL, Verilog, Silage, Esterel...) - No direct control of implementation
- No direct control of binding, mapping
- No distinction between problem statement and
constraints - No canonical representation of design space
- Fundamental problem covers enormous scope
- Universality issues in specification
- How to capture design mapping knowledge?
- How to create verifiable design representation
without canonical model?
- Our viewpoint -- wrong problem
3Simpler Problem
- Assume Designer creates the design
- Support incremental refinement of design at all
levels of representation - Support incremental design synthesis when
possible - Provide well defined hierarchy on which to place
constraints, trial implementations ... - Provide mechanism for subsystem abstraction,
modeling and evaluation at each level - How to do this?
- Drop representation distinction between logic,
module, and sub-system levels - Drop potential for universality in internal
representations - Create mechanism for automatic design abstraction
within designer's design decomposition - Use efficient representation of fundamental model
- Provide feedback to designer for evaluating both
the design itself and the representation - Where do we start?
- Interface Protocols are key complexity growth
problem - Designer constructs system model with abstract
protocols, required data-flows, possible maps - Generalize scheduling to provide possible
sequencing of sub-systems into systems meeting
external protocol constraints (models)
4Protocol Constrained Scheduling
- Problem Conventional scheduling algorithms
cannot accommodate the typical complex sequencing
and timing constraints of modern design. - Three Problems Specification, Scheduling, and
Problem Scale - Specification How to specify the required timing
in an concise, explicit way? - Scheduling How to systematically exploit mapping
freedom while meeting the timing requirements? - Problem Scale Problems of interest to industry
are enormously complex! - Idea Protocol specification is amenable to NDFA
modeling -- so create automata-based model to
represent Control/Data-flow freedom gt All
possible implementations exist as sequences of
states of the joint automaton
5Protocol Specification
- Sequencing complexity of digital system
interfaces increasing - Specification languages Verilog?, VHDL require
implicit protocol specification - Alternative specification via NDFA automata (e.g.
PBS, Esterel, Custom point language) - Representation is finite
- Synthesis can be very efficient -- can handle
very complex designs - Provides mechanism for time sequence
specification relatively independent of data-flow
control semantics - Protocol CDFG semantics mapping abstractions
make a complete model - No ad-hoc mapping library (beyond control of
designer) - No convenient dependency binding assumptions (to
be worked around by designer) - No encrypting desired sequential FSM in higher
level language! - Designer specifies event sequences he wants
- System evaluates/synthesizes ensemble FSM
6Design Representation
- Model System as hierarchy of design frames
- Frames have external protocol specification NDFA,
CDFG, and allowed Mappings - Frames contain instances of other frames
abstractions (abstracted NDFA/CDFG model) - Resource utilization and sharing restricted to
within a design frame
Sub-frame Model
Control Data Flow Graph
Frame
External Protocol
7Hierarchy of Refinement
- Exact protocol scheduling intractable for
practical large problems - Hierarchy of Refinement
- partition the problem into manageable
abstractions - hides lower level details
- allows systematic high-level pruning of designs
before more detailed treatment - Completed sub-frame designs can be abstracted to
high level component models - allows incremental design change/refinement at
any level - --provides mechanism for consistency verification
8Protocol Scheduling Implementation
- Represent CDFG model as Causal (NDFA) Automaton
- Generalization of current scheduling model
- Models all valid data flows
- Models code hoisting, unrolling,
transformations... - Represent External Protocol as NDFA automaton
- Very general, efficient model
- Synchronous timing model (can be generalized--
future work) - Alternative behavior as NDFA alternatives
- CDFG maps I/O operations among sub-frames
- Sub-frames have interface protocols, abstracted
CDFG semantics - Construct ensemble automata model with all valid
sequences of events meeting internal and external
protocols and causal data-flow constraints - Need only find complete sub-set of all possible
states for solution
9Scheduling Solution
- Every schedule is some subset of states of the
ensemble automaton - Must construct causal and complete set of states
- Exact solution strategies
- Construct all states up to resource bounds
- Depth-first search of states
- Heuristic search -- choose good path, complete
schedule automatically - Prune solution space
- Additional constraints or objectives -- technique
works best when highly constrained - Heuristic strategies
- Sub-set BDD representation of reachable states
- Incremental search (this is not verification!)
- Possible objectives
- Communication
- Temporary storage (memory)
- Performance
- Control complexity
10DFA Model of Two Stage Pipe
- Input 1 indicates operands are supplied to the
pipe - Output 1 indicates operand is produced by the
pipe
State
a
b
c
b
d
c
b
d
d
c
a
11NDFA Protocol for Two Stage Pipe
- Inputs and outputs same as DFA model
- Some transitions produce no outputs
12Operand Scheduling a CDFG on NDFA Protocol
- CDFG to Schedule
- Two stage NDFA protocol description for component
- Protocol alone is insufficient -- need internal
data-flow requirements - Mapping is trivial (in this case)
- Protocol CDFG is sufficient -- but also
describes information not needed externally - Solution Simplify scheduling solution of
sub-frame to make abstracted model
13Operand Schedule on NDFA Protocol
- Optimal one multiplier schedule (co-execution of
protocol and causal automata)
14Causal Automaton Formulation of Scheduling
- Scheduling Problem (V, E, C, R)
- vertex v eV is an operation
- edge (u,v) e E is a directed edge representing a
data dependency - hyper-edge vc,VTc,VFc groups a control
operation and corresponding subsets of operations - hyper-edge bound, (T m V) e R represents a
resource bound applied to a subset of (mapped)
operations - The edge set is partitioned into a forest of
forward edges and a subset of looping edges which
point backward - Scheduling solution is a complete, compatible set
of deterministic sequences of vertices such that
all dependencies are causal and all resource
bounds are met at each state, and the set has
sequences for each possible future value of the
set of controls. - In the following, we will discuss minimum latency
and maximal throughput as objective functions.
15Single-Cycle Operation Modeling Automata
1? 1
0?0
0?1
1? 0
- 0?0 Operation unscheduled and remains so
- 0?1 Operation scheduled next cycle
- 1?1 Operation scheduled and remains so
- 1?0 Operation scheduled but result lost
16Scheduling Automata
- State represents current set of available
operands and state of modeling protocol automata - Constraints on transitions
- Representation Compact
- Product of Mapped Modeling automata for each
resource protocol
17Resource Bounds
- 0?1 indicates resource
- Resource bounds constrain simultaneous 0?1
transitions - Iterative constraint on CA
- ROBDD representation
- 2? bound? operations
One Resource
18Dependency Implication
- All transitions in which j is active before
all of its predecessors are known are removed - BDD Complexity is O(predecessors
operations)
19Example NFA
- Transition relation induces graph
- Any path from all operations unknown to all known
is a valid schedule
- Shortest paths are minimum latency schedules
20All Minimum Latency Schedules
- Symbolic reachable state analysis
- Newly reached states are saved each cycle
- Backward pruning preserves transitions used in
all shortest paths
21All Minimum Latency Schedules
- Symbolic reachable state analysis
- Newly reached states are saved each cycle
- Backward pruning preserves transitions used in
all shortest paths
22All Minimum Latency Schedules
- Symbolic reachable state analysis
- Newly reached states are saved each cycle
- Backward pruning preserves transitions used in
all shortest paths
23All Minimum Latency Schedules
- Symbolic reachable state analysis
- Newly reached states are saved each cycle
- Backward pruning preserves transitions used in
all shortest paths
24All Minimum Latency Schedules
- Described construction is Exact --
- Suitable heuristics are available and since they
can use arbitrary subsets of the potential
schedules are powerful
25CDFG Representation
26CDFGs Multiple Control Paths
- Guard automata differentiate control paths
- Before control operation scheduled
Control value unknown
- After control operation scheduled
- Guards are implemented as modified operation
automata
27CDFGs Multiple Control Paths
- All control paths form ensemble schedule
- Possibly 2c control paths to schedule
(non-looping case) - Dummy operation identifies when control path
terminates - Only one termination operation
- Ensemble schedule need not be causal!
- Need solution for each control path
(Completeness) - Need compatibility between paths whose control is
not resolved (Causality) - Solution validation algorithm
- Validation is a path to path property for all
control paths in ensemble schedule - Fixed Point Iteration
28CDFG Example
29Validated CDFG Example
- Validation algorithm ensures control paths dont
bifurcate before control value is known
30Validated CDFG Example
- Validation algorithm ensures control paths dont
bifurcate before control value is known - Pruned for all shortest paths as before
31Validation Algorithm
- Validation Proceeds on potential traces
- Re-traverse Automata, Dynamically Modifying
Transition Relation based on current available
states in each time step Allow guard computation
only for states with matching histories if the
guard is true or false. - Iterate until fixed point on all paths
- Apply the following non-linear filter to each
transition
32Selected CDFG Benchmarks
33Large Benchmarks
957
34Comparison of CPU Times
35Required CPU Seconds
36Construction for Looping DFGs
- Use trick 0/1 representation of the MA could be
interpreted as 2 mutually exclusive operand
productions - Schedule from know -gt known -gt known where each
0-gt1 or 1-gt0 transition requires a resource. - Since dependencies are on operands, add new
dependencies in 1 -gt0 sense as well - Idea is to remove all transitions which do not
have complete set of known or known predecessors
for respective sense of operation - So -- get looping DFG automata as nearly same
automata as before - preserve efficient representation
- Selection of Minimal Latency solutions is more
difficult
37Loop construction resources
- Resources we now count both 0 -gt 1 and 1 -gt0
transition as requiring a resource. - Use Tuple BDD construction at most k bits of n
BDD - Despite exponential number of product terms, BDD
complexity O(bound V)
38Example CA
- State order (v1,v2,v3,v4)
- Path 0,9,C,7,2,9,C,7,2,is a valid schedule.
- By construction, only 1 instance of any operator
can occur in a state.
39Strategy to Find Maximal Throughput
- CA automata construction simple
- How to find closed subset of paths guaranteeing
optimal throughput - Could start from known initial state and prune
slow paths as before-- but this is not optimal! - Instead find all reachable states (without
resource bounds) - Use state set to prune unreachable transitions
from CA - Choose operator at random to be pinned (marked)
- Propagate all states with chosen operator until
it appears again in same sense - Verify closure of constructed paths by Fixed
Point iteration - If set is empty -- add one clock to latency and
verify again - Result is maximal closed set of paths for which
optimal throughput is guaranteed
40Maximal Throughput Example
- DFG above has closed 3-cycle solution (2
resources) - However- average latency is 2.5-cycles
- (a,d) (b,e) (a,c) (b,d) (c,e) (a,d)
- Requires 5 states to implement optimal throughput
instance - In general, it is possible that a k-cycle closed
solution may exist, even if no k-state solution
can be found - Current implementation finds all possible k-cycle
solutions
41EWF Looping Benchmarks
268
42Synthetic Benchmarks
- Over 100 synthetic benchmarks tested
- Sizes 50 operator, 100 operator, randomly
assigned dependency chains, resources - 32 had no causal schedule
- 35 had all maximum throughput schedules found in
15 minute timeout (1 minute Reachable States, 14
minute Fixed Point) - 33 Timed Out
- Analysis of timeout cases most included
disconnected independent sub-graphs - Trial partitioning of the Transition Relation
looks very promising on these cases (time/space
reduction nearly quadratic!)
43Synthetic Loop Benchmarks
44Schedule Exploration Loops
- Idea Use partial symbolic traversal to find
states bounding minimal latency paths - Latency-- Identify all paths completing cycle in
given number of steps - Repeatability-- Fixed Point Algorithm to
eliminate all paths which cannot repeat in given
latency - Validation-- Ensure all possible control paths
are present for each remaining path - Optimization-- Selection of Performance Objective
45Kernel Execution Sequence Set
- Path from Loop cut to first repeating states
- Represents candidates for loop kernel
Loop Kernel
46Repeatable Kernel Execution Sequence Set
- Fixed-point prunes non-repeating states
- Only repeatable loop kernels remain
- Paths not all same length
- Average latency lt shortest Repeating Kernel
Loop Cut
Repeatable Loop Kernel
47Validation I
- Schedule Consists of bundle of compatible paths
for each possible future - Not Feasible to identify all schedules
- Instead, eliminate all states which do not belong
to some ensemble schedule - Fragile since any further pruning requires
re-validation - Double fixed point
48Validation II
- Path Divergence -- Control Behavior
- Ensure each path is part of some complete set for
each control outcome - Ensure that each set is Causal
49Loop Cuts and Kernels
Loop Cut
- Method Covers all Conventional Loop
Transformations - Sequential Loop
- Loop winding
- Loop Pielining
Loop Kernel
Loop Cut
Loop Kernel
Loop Cut
Loop Kernel
50Results
- Conventional Scheduling
- 100-500x speedup over ILP
- Control Scheduling Complexity typically pseudo
polynomial in number of branching variables - Cyclic Scheudling
- Reduced preamble complexity
- Capacity 200-500 operands in exact
implementation - General Control Dominated Scheduling
- Implicit formulation of all forms of CDFG
transformation - Exact Solutions with Millions of Control paths
- Protocol Constrained Scheduling
- Exact for small instances needs sensible
pruning of domain
51MIPS Model
- SimpleScalar (MIPS IV superset) Model
- Trace Probabilities from MediaBench
- Hierarchical Model
- Collection of Instruction Tasks in Flight
- Each Instruction Task is Complete Behavioral
Model of Instruction Execution, including all
instruction types, hazards, controls, and
Contention for Physical Resources - Additional Sequential Protocols for Memory
Subsystem, both Fetch and Load/Store
52Processor Composition
- Ordered Fetch/Commit
- 3 Simultaneous Instruction Executions
- Sequencing of Instructions separated from
pipeline - Out of Order Prefetch or Commit can be Modeled
53PC update Speculative Fetch
- Speculate Joins to allow early prefetch and
address computation
54MIPS Transaction Dependencies
55MIPS Results Constraints
- Scenario A
- 1/2 cycle tasks, Single Bypass
- 2 cycle Pipelined Double Word Memory Fetch
- 2 cycle Pipelined Multiply
- 2R/1W Register File, 2 ALU's, 2 port Memory
- Scenario B
- 2 cycle Memory Read/Write/Fetch
- 2R -1R/1W Register File, 1 ALU, 1 port Memory
- Cache 1 cycle hit/3 cycle miss, Deferred Pipeline
56MIPS Results Instruction Mix
- Media Bench Tuning
- 88 reg-reg, reg-imm, br taken, load single
- 80 branch taken
- 35 Single Bypass Hazard
- 1 Multiple Bypass (Stall in model)?
- Two Sets of Priority Mixes
- Mix1 favors (reg-reg, reg-imm, br-taken)?
- Mix 2 favors (load-sw, br-taken)?
57MIPS Results Mix 1
- Mix 1 favors reg-reg, reg-imm, and br-taken
58MIPS Results Mix 2
- Mix 2 Favors loads, reg-reg w. branches
59Cache and I/O Protocol
- For 3 instructions in flight gt 542,000 control
paths! - Schedules still exact every optimal sequence is
constructed
60Conclusions
- NFA protocol modeling shown to be effective
representation for generalized scheduling problem - Efficiency of algorithms so far is comparable or
superior to any known exact technique - Potential for powerful heuristics based on
sub-set representation - First exact solutions for a wide variety of
generalized scheduling problems