Dynamic Hardware/Software Partitioning: A First Approach - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Dynamic Hardware/Software Partitioning: A First Approach

Description:

Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer ... – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 35

Provided by: Carla208

Learn more at: https://uweb.engr.arizona.edu

Category:

more less

Transcript and Presenter's Notes

Title: Dynamic Hardware/Software Partitioning: A First Approach

1
Dynamic Hardware/Software Partitioning A First
Approach

Greg Stitt, Roman Lysecky, Frank Vahid
Department of Computer Science and Engineering
University of California, Riverside
Also with the Center for Embedded Computer
Systems at UC Irvine

2
Introduction

Dynamic optimizations an increasing trend
Examples
Dynamo
Dynamic software optimizations
Transmeta Crusoe
Dynamic code morphing
Just In Time Compilation
Interpreted languages
Advantages
Transparent optimizations
No designer effort
No tool restrictions
Adapts to actual usage

3
Introduction

Drawbacks of current dynamic optimizations
Currently limited to software optimizations
Limited speedup (1.1x to 1.3x common)
Alternatively, we could perform hw/sw
partitioning
Achieve large speedups (2x to 10x common)
However, presently dynamic optimization not
possible

Sw ______ ______ ______
4
Introduction

Ideally, we would perform hardware/software
partitioning dynamically
Transparent partitioning
Supports all sw languages/tools
Most partitioning approaches have complex tool
flows
Achieves better results than software
optimizations
gt2x speedup, energy savings
Adapts to actual usage
Appropriate architecture required
Requires a processor and configurable logic

5
Introduction

Microprocessor/FPGA single-chip platforms make
partitioning more attractive
More efficient communication, smaller size
Higher performance, low power
Examples
Xilinx Virtex II Pro, Triscend E5/A7, Altera
Excalibur, Atmel FPSLIC
Makes dynamic hw/sw partitioning more feasible
However, partitioning must be performed at binary
level

1990s
2003
6
Introduction

Binary-level hw/sw partitioning
Binary is profiled and hardware candidates are
determined
Regions to be partitioned are decompiled into
CDFG
CDFG is synthesized to hardware
Binary is updated to use hardware
Many advantages over source-level partitioning
Supports any language or software compiler
No change in tools
Better software size and performance estimation
at binary level
Enables dynamic hw/sw partitioning

7
Dynamic Hw/Sw Partitioning
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
8
Dynamic Hw/Sw Partitioning
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
9
Dynamic Hw/Sw Partitioning
add
add
add
add
add
add
add
add
add
add
add
Dynamic Partitioning Module
add
add
add
add
add
add
add
add
add
add
add
10
Dynamic Hw/Sw Partitioning
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
Dynamic Partitioning Module
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
11
Dynamic Hw/Sw Partitioning
Dynamic Partitioning Module
SW
SW
SW
SW
SW
SW
SW
SW
12
Dynamic Hw/Sw Partitioning
Dynamic Partitioning Module
HW
HW
HW
HW
HW
HW
HW
Frequent Loops
13
Dynamic Hw/Sw Partitioning
Configurable Logic
Frequent Loops
14
Dynamic Partitioning Module

Dynamic partitioning module executes partitioning
tools on chip
Profiler, partitioning compiler, synthesis,
placeroute

SW Source
Profiler
Partitioning Compiler
SW Binary
Synthesis
PlaceRoute
HW
15
Dynamic Partitioning Module

Synthesis and place route tools all moved
on-chip
These tools typically execute on powerful
workstations
Most people will cringe at idea of moving these
tools on-chip
However, dynamic partitioning deals with small
regions of code
Typically, small innermost loops
Therefore, we can develop lean tools that work
specifically for these small loops
Lean tools make on-chip execution possible
Area overhead becoming less critical due to
Moores Law

16
System Architecture

Microprocessors
MIPS (may be many)
On-chip memory
Configurable logic
Dynamic partitioning module

17
Dynamic Partitioning Module

Dynamically detects frequent loops and then
reimplements the loops in hardware running on the
configurable logic
Architectural components
Profiler
Additional processor and memory
But SOCs may have dozens anyways
Alternatively, we could share main processor

18
Configurable Logic

Greatly simplified in order to create lean place
route tools
DMA used to access memory
Two registers
R0_Input stores data from memory
R1_InOut stores temporary data data to write
back to memory
Fabric
Supports combinational logic
Implies loops must have body implemented in
single cycle (temporary restriction)

DMA
R0_Input
R1_InOut
Configurable Logic Fabric
19
Configurable Logic Fabric

Fabric
3-input 2-output LUTS surrounded by switch
matrices
Switch Matrix
Connect wire to same channel on different side
LUT
3-input (8 word) 2-output SRAM

Configurable Logic Fabric
Switch Matrix
LUT
20
Tool Overview

Tool flow slightly different from standard
partitioning flow
Decompilation
Binary modification

21
Loop Profiling

Non-intrusive profiler
Monitors instruction bus
Very little overhead
Small cache (16 entries) and 2,300 logic gates
Less than 1 power overhead

To L1 Memory
Micro-processor
rd/wr
Frequent Loop Cache
Frequent Loop Cache Controller
addr
rd/wr
data
addr
saturation
sbb

data
data
22
Decompilation

Decompilation recovers high-level information
Creates optimized CDFG
All instruction-set inefficiencies are removed
Binary partitioning has been shown to achieve
similar results to source-level partitioning for
many applications
Greg Stitt, Frank Vahid, ICCAD 2002

23
DMA Configuration

Maps memory accesses to our DMA architecture
Reads/writes
Increment/decrement address updates
Single/block request modes
Optimizes DFG for DMA
Removes address calculations
Removes loop counters/exit conditions

r3
24
Register Transfer Synthesis

Maps DFG operations to hw library components
Adders, Comparators, Multiplexors, Shifters
Creates Boolean expression for each output bit in
dataflow graph by replacing hw components with
corresponding expressions

r40r10 xor r20, carry0r10 and
r20 r41(r11 xor r21) xor carry0,
carry1 . .
25
Logic Synthesis

Optimizes Boolean equations from RT synthesis
Large opportunity for logic minimization due to
use of immediate values in the binary
Simple on-chip 2-level logic minimization method
Lysecky/Vahid DAC03, session 20.4 (945 Wed)

r20 r10 xor 0 xor 0 r21 r11 xor 0 xor
carry0 r22 r12 xor 1 xor carry1 r23
r13 xor 0 xor carry2
26
Technology Mapping

Maps logic operations to 3-input, 2-output LUTs
Traverse logic network and combine nodes to
determine single output LUTs
Combine nodes to form two output LUTs

27
Placement

Nodes along critical path are placed in single
horizontal row
Build dependencies between remaining nodes and
placed nodes
Use dependencies to place remaining nodes
Either above or below placed nodes

28
Routing

Greedy algorithm
At each switch matrix, choose directionto route
Continue to route until reaching switchmatrix
that is already in use
Backtrack to previous switch matrix,and try
another direction
Place and route most complex task
currently working on improvements

29
Bitfile Creation

Combines placerouted hardware description with
DMA configuration into bitfile
Used to initialize the configurable logic

30
Binary Modification

Updates the application binary in order to
utilize the new hardware
Loop replaced with jump to hw initialization code
Wisconsin Architectural Research Tool Set (WARTS)
EEL (Executable Editing Library)
We assume memory is RAM or programmable ROM

loop Load r2, 0(r1) Add r1, r1, 1 Add r3, r3,
r2 Blt r1, 8, loop after_loop ..

hw_init
Initialize HW registers
Enable HW
Shutdown processor
Woken up by HW interrupt
Store any results
Jump to after_loop

31
Tool Statistics

Executed on SimpleScalar
Similar to a MIPS instruction set
Used 60 MHz clock (like Triscend A7 device)
Statistics
Total run time of only 1.09 seconds
Requires less than ½ megabyte of RAM
Code size much smaller than standard synthesis
tools

32
Experiments

Benchmark Information
Powerstone (Brev, g3fax12)
NetBench (url)
Logic minimization kernel (logmin)
Statistics
55 of total time spent in loops that are moved
to hardware
Ideal speedup of 2.8
These loops were only 2.4 of the size of the
original application

33
Experiments

Results
Achieved average speedup of 2.6, close to ideal
2.8
Hardware loops were 20X faster than software
loops
Even with simple architecture and tools, large
speedups were achieved

34
Conclusion

Dynamic hardware/software partitioning has
advantages over other partitioning approaches
Completely transparent
Designers get performance/energy benefits of
hw/sw partitioning by simply writing software
Quality likely not as good as desktop CAD for
some applications, so most suitable when
transparency is critical (very often!)
Achieved average speedup of 2.6
Very close to ideal speedup of 2.8
Future work
More complex configurable logic fabric
Designed in close conjunction with on-chip CAD
tools
Sequential logic and increased inputs/outputs
Support larger hardware regions, not just simple
loops
Improved algorithms (especially place and route)
Handle more complex memory access patterns