Title: Dynamic Hardware/Software Partitioning: A First Approach
1Dynamic Hardware/Software Partitioning A First
Approach
- Greg Stitt, Roman Lysecky, Frank Vahid
- Department of Computer Science and Engineering
- University of California, Riverside
- Also with the Center for Embedded Computer
Systems at UC Irvine
2Introduction
- Dynamic optimizations an increasing trend
- Examples
- Dynamo
- Dynamic software optimizations
- Transmeta Crusoe
- Dynamic code morphing
- Just In Time Compilation
- Interpreted languages
- Advantages
- Transparent optimizations
- No designer effort
- No tool restrictions
- Adapts to actual usage
3Introduction
- Drawbacks of current dynamic optimizations
- Currently limited to software optimizations
- Limited speedup (1.1x to 1.3x common)
- Alternatively, we could perform hw/sw
partitioning - Achieve large speedups (2x to 10x common)
- However, presently dynamic optimization not
possible
Sw ______ ______ ______
4Introduction
- Ideally, we would perform hardware/software
partitioning dynamically - Transparent partitioning
- Supports all sw languages/tools
- Most partitioning approaches have complex tool
flows - Achieves better results than software
optimizations - gt2x speedup, energy savings
- Adapts to actual usage
- Appropriate architecture required
- Requires a processor and configurable logic
5Introduction
- Microprocessor/FPGA single-chip platforms make
partitioning more attractive - More efficient communication, smaller size
- Higher performance, low power
- Examples
- Xilinx Virtex II Pro, Triscend E5/A7, Altera
Excalibur, Atmel FPSLIC - Makes dynamic hw/sw partitioning more feasible
- However, partitioning must be performed at binary
level
1990s
2003
6Introduction
- Binary-level hw/sw partitioning
- Binary is profiled and hardware candidates are
determined - Regions to be partitioned are decompiled into
CDFG - CDFG is synthesized to hardware
- Binary is updated to use hardware
- Many advantages over source-level partitioning
- Supports any language or software compiler
- No change in tools
- Better software size and performance estimation
at binary level - Enables dynamic hw/sw partitioning
7Dynamic Hw/Sw Partitioning
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
8Dynamic Hw/Sw Partitioning
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
9Dynamic Hw/Sw Partitioning
add
add
add
add
add
add
add
add
add
add
add
Dynamic Partitioning Module
add
add
add
add
add
add
add
add
add
add
add
10Dynamic Hw/Sw Partitioning
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
Dynamic Partitioning Module
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
11Dynamic Hw/Sw Partitioning
Dynamic Partitioning Module
SW
SW
SW
SW
SW
SW
SW
SW
12Dynamic Hw/Sw Partitioning
Dynamic Partitioning Module
HW
HW
HW
HW
HW
HW
HW
Frequent Loops
13Dynamic Hw/Sw Partitioning
Configurable Logic
Frequent Loops
14Dynamic Partitioning Module
- Dynamic partitioning module executes partitioning
tools on chip - Profiler, partitioning compiler, synthesis,
placeroute
SW Source
Profiler
Partitioning Compiler
SW Binary
Synthesis
PlaceRoute
HW
15Dynamic Partitioning Module
- Synthesis and place route tools all moved
on-chip - These tools typically execute on powerful
workstations - Most people will cringe at idea of moving these
tools on-chip - However, dynamic partitioning deals with small
regions of code - Typically, small innermost loops
- Therefore, we can develop lean tools that work
specifically for these small loops - Lean tools make on-chip execution possible
- Area overhead becoming less critical due to
Moores Law
16System Architecture
- Microprocessors
- MIPS (may be many)
- On-chip memory
- Configurable logic
- Dynamic partitioning module
17Dynamic Partitioning Module
- Dynamically detects frequent loops and then
reimplements the loops in hardware running on the
configurable logic - Architectural components
- Profiler
- Additional processor and memory
- But SOCs may have dozens anyways
- Alternatively, we could share main processor
18Configurable Logic
- Greatly simplified in order to create lean place
route tools - DMA used to access memory
- Two registers
- R0_Input stores data from memory
- R1_InOut stores temporary data data to write
back to memory - Fabric
- Supports combinational logic
- Implies loops must have body implemented in
single cycle (temporary restriction)
DMA
R0_Input
R1_InOut
Configurable Logic Fabric
19Configurable Logic Fabric
- Fabric
- 3-input 2-output LUTS surrounded by switch
matrices - Switch Matrix
- Connect wire to same channel on different side
- LUT
- 3-input (8 word) 2-output SRAM
Configurable Logic Fabric
Switch Matrix
LUT
20Tool Overview
- Tool flow slightly different from standard
partitioning flow - Decompilation
- Binary modification
21Loop Profiling
- Non-intrusive profiler
- Monitors instruction bus
- Very little overhead
- Small cache (16 entries) and 2,300 logic gates
- Less than 1 power overhead
To L1 Memory
Micro-processor
rd/wr
Frequent Loop Cache
Frequent Loop Cache Controller
addr
rd/wr
data
addr
saturation
sbb
data
data
22Decompilation
- Decompilation recovers high-level information
- Creates optimized CDFG
- All instruction-set inefficiencies are removed
- Binary partitioning has been shown to achieve
similar results to source-level partitioning for
many applications - Greg Stitt, Frank Vahid, ICCAD 2002
23DMA Configuration
- Maps memory accesses to our DMA architecture
- Reads/writes
- Increment/decrement address updates
- Single/block request modes
- Optimizes DFG for DMA
- Removes address calculations
- Removes loop counters/exit conditions
r3
24Register Transfer Synthesis
- Maps DFG operations to hw library components
- Adders, Comparators, Multiplexors, Shifters
- Creates Boolean expression for each output bit in
dataflow graph by replacing hw components with
corresponding expressions
r40r10 xor r20, carry0r10 and
r20 r41(r11 xor r21) xor carry0,
carry1 . .
25Logic Synthesis
- Optimizes Boolean equations from RT synthesis
- Large opportunity for logic minimization due to
use of immediate values in the binary - Simple on-chip 2-level logic minimization method
- Lysecky/Vahid DAC03, session 20.4 (945 Wed)
r20 r10 xor 0 xor 0 r21 r11 xor 0 xor
carry0 r22 r12 xor 1 xor carry1 r23
r13 xor 0 xor carry2
26Technology Mapping
- Maps logic operations to 3-input, 2-output LUTs
- Traverse logic network and combine nodes to
determine single output LUTs - Combine nodes to form two output LUTs
27Placement
- Nodes along critical path are placed in single
horizontal row - Build dependencies between remaining nodes and
placed nodes - Use dependencies to place remaining nodes
- Either above or below placed nodes
28Routing
- Greedy algorithm
- At each switch matrix, choose directionto route
- Continue to route until reaching switchmatrix
that is already in use - Backtrack to previous switch matrix,and try
another direction - Place and route most complex task
- currently working on improvements
29Bitfile Creation
- Combines placerouted hardware description with
DMA configuration into bitfile - Used to initialize the configurable logic
30Binary Modification
- Updates the application binary in order to
utilize the new hardware - Loop replaced with jump to hw initialization code
- Wisconsin Architectural Research Tool Set (WARTS)
- EEL (Executable Editing Library)
- We assume memory is RAM or programmable ROM
loop Load r2, 0(r1) Add r1, r1, 1 Add r3, r3,
r2 Blt r1, 8, loop after_loop ..
- hw_init
- Initialize HW registers
- Enable HW
- Shutdown processor
- Woken up by HW interrupt
- Store any results
- Jump to after_loop
31Tool Statistics
- Executed on SimpleScalar
- Similar to a MIPS instruction set
- Used 60 MHz clock (like Triscend A7 device)
- Statistics
- Total run time of only 1.09 seconds
- Requires less than ½ megabyte of RAM
- Code size much smaller than standard synthesis
tools
32Experiments
- Benchmark Information
- Powerstone (Brev, g3fax12)
- NetBench (url)
- Logic minimization kernel (logmin)
- Statistics
- 55 of total time spent in loops that are moved
to hardware - Ideal speedup of 2.8
- These loops were only 2.4 of the size of the
original application
33Experiments
- Results
- Achieved average speedup of 2.6, close to ideal
2.8 - Hardware loops were 20X faster than software
loops - Even with simple architecture and tools, large
speedups were achieved
34Conclusion
- Dynamic hardware/software partitioning has
advantages over other partitioning approaches - Completely transparent
- Designers get performance/energy benefits of
hw/sw partitioning by simply writing software - Quality likely not as good as desktop CAD for
some applications, so most suitable when
transparency is critical (very often!) - Achieved average speedup of 2.6
- Very close to ideal speedup of 2.8
- Future work
- More complex configurable logic fabric
- Designed in close conjunction with on-chip CAD
tools - Sequential logic and increased inputs/outputs
- Support larger hardware regions, not just simple
loops - Improved algorithms (especially place and route)
- Handle more complex memory access patterns