Forrest%20Brewer%20forrest@ece.ucsb.edu - PowerPoint PPT Presentation

About This Presentation

Title:

Forrest%20Brewer%20forrest@ece.ucsb.edu

Description:

Santa Barbara. Forrest Brewer forrest_at_ece.ucsb.edu. UCSB CAD and Test Group. ECE/UCSB Santa Barbara CA 93106. Scheduling is Behavioral Synthesis ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 61

Provided by: steveh90

Learn more at: http://bears.ece.ucsb.edu

Category:

more less

Transcript and Presenter's Notes

Title: Forrest%20Brewer%20forrest@ece.ucsb.edu

1
NDFA Based Scheduling

Forrest Brewer, Steve Haynal
University of California
Santa Barbara

2
Scheduling is Behavioral Synthesis

Exploits fundamental freedom -- ordering and
binding of operations, operands
Subdivided into DFG transformation, resource
allocation, time-scheduling, operation binding,
memory binding, communication binding, resource
modeling, reallocation...
Complexity of tasks requires top-down flow -- yet
evaluations/constraints are bottom-up
Behavioral Synthesis difficult to use!
Seemingly trivial changes cause vast output
changes
Design tradeoffs tied to a particular point
language (VHDL, Verilog, Silage, Esterel...)
No direct control of implementation
No direct control of binding, mapping
No distinction between problem statement and
constraints
No canonical representation of design space
Fundamental problem covers enormous scope
Universality issues in specification
How to capture design mapping knowledge?
How to create verifiable design representation
without canonical model?

Our viewpoint -- wrong problem

3
Simpler Problem

Assume Designer creates the design
Support incremental refinement of design at all
levels of representation
Support incremental design synthesis when
possible
Provide well defined hierarchy on which to place
constraints, trial implementations ...
Provide mechanism for subsystem abstraction,
modeling and evaluation at each level
How to do this?
Drop representation distinction between logic,
module, and sub-system levels
Drop potential for universality in internal
representations
Create mechanism for automatic design abstraction
within designer's design decomposition
Use efficient representation of fundamental model
Provide feedback to designer for evaluating both
the design itself and the representation
Where do we start?
Interface Protocols are key complexity growth
problem
Designer constructs system model with abstract
protocols, required data-flows, possible maps
Generalize scheduling to provide possible
sequencing of sub-systems into systems meeting
external protocol constraints (models)

4
Protocol Constrained Scheduling

Problem Conventional scheduling algorithms
cannot accommodate the typical complex sequencing
and timing constraints of modern design.
Three Problems Specification, Scheduling, and
Problem Scale
Specification How to specify the required timing
in an concise, explicit way?
Scheduling How to systematically exploit mapping
freedom while meeting the timing requirements?
Problem Scale Problems of interest to industry
are enormously complex!
Idea Protocol specification is amenable to NDFA
modeling -- so create automata-based model to
represent Control/Data-flow freedom gt All
possible implementations exist as sequences of
states of the joint automaton

5
Protocol Specification

Sequencing complexity of digital system
interfaces increasing
Specification languages Verilog?, VHDL require
implicit protocol specification
Alternative specification via NDFA automata (e.g.
PBS, Esterel, Custom point language)
Representation is finite
Synthesis can be very efficient -- can handle
very complex designs
Provides mechanism for time sequence
specification relatively independent of data-flow
control semantics
Protocol CDFG semantics mapping abstractions
make a complete model
No ad-hoc mapping library (beyond control of
designer)
No convenient dependency binding assumptions (to
be worked around by designer)
No encrypting desired sequential FSM in higher
level language!
Designer specifies event sequences he wants
System evaluates/synthesizes ensemble FSM

6
Design Representation

Model System as hierarchy of design frames
Frames have external protocol specification NDFA,
CDFG, and allowed Mappings
Frames contain instances of other frames
abstractions (abstracted NDFA/CDFG model)
Resource utilization and sharing restricted to
within a design frame

Sub-frame Model
Control Data Flow Graph
Frame
External Protocol
7
Hierarchy of Refinement

Exact protocol scheduling intractable for
practical large problems
Hierarchy of Refinement
partition the problem into manageable
abstractions
hides lower level details
allows systematic high-level pruning of designs
before more detailed treatment
Completed sub-frame designs can be abstracted to
high level component models
allows incremental design change/refinement at
any level
--provides mechanism for consistency verification

8
Protocol Scheduling Implementation

Represent CDFG model as Causal (NDFA) Automaton
Generalization of current scheduling model
Models all valid data flows
Models code hoisting, unrolling,
transformations...
Represent External Protocol as NDFA automaton
Very general, efficient model
Synchronous timing model (can be generalized--
future work)
Alternative behavior as NDFA alternatives
CDFG maps I/O operations among sub-frames
Sub-frames have interface protocols, abstracted
CDFG semantics
Construct ensemble automata model with all valid
sequences of events meeting internal and external
protocols and causal data-flow constraints
Need only find complete sub-set of all possible
states for solution

9
Scheduling Solution

Every schedule is some subset of states of the
ensemble automaton
Must construct causal and complete set of states
Exact solution strategies
Construct all states up to resource bounds
Depth-first search of states
Heuristic search -- choose good path, complete
schedule automatically
Prune solution space
Additional constraints or objectives -- technique
works best when highly constrained
Heuristic strategies
Sub-set BDD representation of reachable states
Incremental search (this is not verification!)
Possible objectives
Communication
Temporary storage (memory)
Performance
Control complexity

10
DFA Model of Two Stage Pipe

Input 1 indicates operands are supplied to the
pipe
Output 1 indicates operand is produced by the
pipe

State
a
b
c
b
d
c
b
d
d
c
a
11
NDFA Protocol for Two Stage Pipe

Inputs and outputs same as DFA model
Some transitions produce no outputs

12
Operand Scheduling a CDFG on NDFA Protocol

CDFG to Schedule
Two stage NDFA protocol description for component
Protocol alone is insufficient -- need internal
data-flow requirements
Mapping is trivial (in this case)
Protocol CDFG is sufficient -- but also
describes information not needed externally
Solution Simplify scheduling solution of
sub-frame to make abstracted model

13
Operand Schedule on NDFA Protocol

Optimal one multiplier schedule (co-execution of
protocol and causal automata)

14
Causal Automaton Formulation of Scheduling

Scheduling Problem (V, E, C, R)
vertex v eV is an operation
edge (u,v) e E is a directed edge representing a
data dependency
hyper-edge vc,VTc,VFc groups a control
operation and corresponding subsets of operations
hyper-edge bound, (T m V) e R represents a
resource bound applied to a subset of (mapped)
operations
The edge set is partitioned into a forest of
forward edges and a subset of looping edges which
point backward
Scheduling solution is a complete, compatible set
of deterministic sequences of vertices such that
all dependencies are causal and all resource
bounds are met at each state, and the set has
sequences for each possible future value of the
set of controls.
In the following, we will discuss minimum latency
and maximal throughput as objective functions.

15
Single-Cycle Operation Modeling Automata
1? 1
0?0
0?1
1? 0

0?0 Operation unscheduled and remains so

0?1 Operation scheduled next cycle

1?1 Operation scheduled and remains so

1?0 Operation scheduled but result lost

16
Scheduling Automata

State represents current set of available
operands and state of modeling protocol automata
Constraints on transitions
Representation Compact
Product of Mapped Modeling automata for each
resource protocol

17
Resource Bounds

0?1 indicates resource
Resource bounds constrain simultaneous 0?1
transitions
Iterative constraint on CA

ROBDD representation
2? bound? operations

One Resource
18
Dependency Implication

All transitions in which j is active before
all of its predecessors are known are removed
BDD Complexity is O(predecessors
operations)

19
Example NFA

Assume 1 resource

Transition relation induces graph

Any path from all operations unknown to all known
is a valid schedule

Shortest paths are minimum latency schedules

20
All Minimum Latency Schedules

Symbolic reachable state analysis

Newly reached states are saved each cycle

Backward pruning preserves transitions used in
all shortest paths

21
All Minimum Latency Schedules

Symbolic reachable state analysis

Newly reached states are saved each cycle

Backward pruning preserves transitions used in
all shortest paths

22
All Minimum Latency Schedules

Symbolic reachable state analysis

Newly reached states are saved each cycle

Backward pruning preserves transitions used in
all shortest paths

23
All Minimum Latency Schedules

Symbolic reachable state analysis

Newly reached states are saved each cycle

Backward pruning preserves transitions used in
all shortest paths

24
All Minimum Latency Schedules

Described construction is Exact --
Suitable heuristics are available and since they
can use arbitrary subsets of the potential
schedules are powerful

25
CDFG Representation
26
CDFGs Multiple Control Paths

Guard automata differentiate control paths
Before control operation scheduled

Control value unknown

After control operation scheduled

Guards are implemented as modified operation
automata

27
CDFGs Multiple Control Paths

All control paths form ensemble schedule
Possibly 2c control paths to schedule
(non-looping case)
Dummy operation identifies when control path
terminates
Only one termination operation
Ensemble schedule need not be causal!
Need solution for each control path
(Completeness)
Need compatibility between paths whose control is
not resolved (Causality)
Solution validation algorithm
Validation is a path to path property for all
control paths in ensemble schedule
Fixed Point Iteration

28
CDFG Example

One green resource

Shortest paths

False termination

29
Validated CDFG Example

Validation algorithm ensures control paths dont
bifurcate before control value is known

30
Validated CDFG Example

Validation algorithm ensures control paths dont
bifurcate before control value is known
Pruned for all shortest paths as before

31
Validation Algorithm

Validation Proceeds on potential traces
Re-traverse Automata, Dynamically Modifying
Transition Relation based on current available
states in each time step Allow guard computation
only for states with matching histories if the
guard is true or false.
Iterate until fixed point on all paths
Apply the following non-linear filter to each
transition

32
Selected CDFG Benchmarks
33
Large Benchmarks
957
34
Comparison of CPU Times
35
Required CPU Seconds
36
Construction for Looping DFGs

Use trick 0/1 representation of the MA could be
interpreted as 2 mutually exclusive operand
productions
Schedule from know -gt known -gt known where each
0-gt1 or 1-gt0 transition requires a resource.
Since dependencies are on operands, add new
dependencies in 1 -gt0 sense as well
Idea is to remove all transitions which do not
have complete set of known or known predecessors
for respective sense of operation
So -- get looping DFG automata as nearly same
automata as before
preserve efficient representation
Selection of Minimal Latency solutions is more
difficult

37
Loop construction resources

Resources we now count both 0 -gt 1 and 1 -gt0
transition as requiring a resource.
Use Tuple BDD construction at most k bits of n
BDD
Despite exponential number of product terms, BDD
complexity O(bound V)

38
Example CA

State order (v1,v2,v3,v4)
Path 0,9,C,7,2,9,C,7,2,is a valid schedule.
By construction, only 1 instance of any operator
can occur in a state.

39
Strategy to Find Maximal Throughput

CA automata construction simple
How to find closed subset of paths guaranteeing
optimal throughput
Could start from known initial state and prune
slow paths as before-- but this is not optimal!
Instead find all reachable states (without
resource bounds)
Use state set to prune unreachable transitions
from CA
Choose operator at random to be pinned (marked)
Propagate all states with chosen operator until
it appears again in same sense
Verify closure of constructed paths by Fixed
Point iteration
If set is empty -- add one clock to latency and
verify again
Result is maximal closed set of paths for which
optimal throughput is guaranteed

40
Maximal Throughput Example

DFG above has closed 3-cycle solution (2
resources)
However- average latency is 2.5-cycles
(a,d) (b,e) (a,c) (b,d) (c,e) (a,d)
Requires 5 states to implement optimal throughput
instance
In general, it is possible that a k-cycle closed
solution may exist, even if no k-state solution
can be found
Current implementation finds all possible k-cycle
solutions

41
EWF Looping Benchmarks
268
42
Synthetic Benchmarks

Over 100 synthetic benchmarks tested
Sizes 50 operator, 100 operator, randomly
assigned dependency chains, resources
32 had no causal schedule
35 had all maximum throughput schedules found in
15 minute timeout (1 minute Reachable States, 14
minute Fixed Point)
33 Timed Out
Analysis of timeout cases most included
disconnected independent sub-graphs
Trial partitioning of the Transition Relation
looks very promising on these cases (time/space
reduction nearly quadratic!)

43
Synthetic Loop Benchmarks
44
Schedule Exploration Loops

Idea Use partial symbolic traversal to find
states bounding minimal latency paths
Latency-- Identify all paths completing cycle in
given number of steps
Repeatability-- Fixed Point Algorithm to
eliminate all paths which cannot repeat in given
latency
Validation-- Ensure all possible control paths
are present for each remaining path
Optimization-- Selection of Performance Objective

45
Kernel Execution Sequence Set

Path from Loop cut to first repeating states
Represents candidates for loop kernel

Loop Kernel
46
Repeatable Kernel Execution Sequence Set

Fixed-point prunes non-repeating states
Only repeatable loop kernels remain
Paths not all same length
Average latency lt shortest Repeating Kernel

Loop Cut
Repeatable Loop Kernel
47
Validation I

Schedule Consists of bundle of compatible paths
for each possible future
Not Feasible to identify all schedules
Instead, eliminate all states which do not belong
to some ensemble schedule
Fragile since any further pruning requires
re-validation
Double fixed point

48
Validation II

Path Divergence -- Control Behavior
Ensure each path is part of some complete set for
each control outcome
Ensure that each set is Causal

49
Loop Cuts and Kernels
Loop Cut

Method Covers all Conventional Loop
Transformations
Sequential Loop
Loop winding
Loop Pielining

Loop Kernel
Loop Cut
Loop Kernel
Loop Cut
Loop Kernel
50
Results

Conventional Scheduling
100-500x speedup over ILP
Control Scheduling Complexity typically pseudo
polynomial in number of branching variables
Cyclic Scheudling
Reduced preamble complexity
Capacity 200-500 operands in exact
implementation
General Control Dominated Scheduling
Implicit formulation of all forms of CDFG
transformation
Exact Solutions with Millions of Control paths
Protocol Constrained Scheduling
Exact for small instances needs sensible
pruning of domain

51
MIPS Model

SimpleScalar (MIPS IV superset) Model
Trace Probabilities from MediaBench
Hierarchical Model
Collection of Instruction Tasks in Flight
Each Instruction Task is Complete Behavioral
Model of Instruction Execution, including all
instruction types, hazards, controls, and
Contention for Physical Resources
Additional Sequential Protocols for Memory
Subsystem, both Fetch and Load/Store

52
Processor Composition

Ordered Fetch/Commit
3 Simultaneous Instruction Executions
Sequencing of Instructions separated from
pipeline
Out of Order Prefetch or Commit can be Modeled

53
PC update Speculative Fetch

Speculate Joins to allow early prefetch and
address computation

54
MIPS Transaction Dependencies
55
MIPS Results Constraints

Scenario A
1/2 cycle tasks, Single Bypass
2 cycle Pipelined Double Word Memory Fetch
2 cycle Pipelined Multiply
2R/1W Register File, 2 ALU's, 2 port Memory
Scenario B
2 cycle Memory Read/Write/Fetch
2R -1R/1W Register File, 1 ALU, 1 port Memory
Cache 1 cycle hit/3 cycle miss, Deferred Pipeline

56
MIPS Results Instruction Mix

Media Bench Tuning
88 reg-reg, reg-imm, br taken, load single
80 branch taken
35 Single Bypass Hazard
1 Multiple Bypass (Stall in model)?
Two Sets of Priority Mixes
Mix1 favors (reg-reg, reg-imm, br-taken)?
Mix 2 favors (load-sw, br-taken)?

57
MIPS Results Mix 1

Mix 1 favors reg-reg, reg-imm, and br-taken

58
MIPS Results Mix 2

Mix 2 Favors loads, reg-reg w. branches

59
Cache and I/O Protocol

For 3 instructions in flight gt 542,000 control
paths!
Schedules still exact every optimal sequence is
constructed

60
Conclusions

NFA protocol modeling shown to be effective
representation for generalized scheduling problem
Efficiency of algorithms so far is comparable or
superior to any known exact technique
Potential for powerful heuristics based on
sub-set representation
First exact solutions for a wide variety of
generalized scheduling problems

Write a Comment

User Comments (0)