Dynamic Hardware/Software Partitioning: A First Approach - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Dynamic Hardware/Software Partitioning: A First Approach

Description:

Greg Stitt, Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer ... – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 35
Provided by: Carla208
Category:

less

Transcript and Presenter's Notes

Title: Dynamic Hardware/Software Partitioning: A First Approach


1
Dynamic Hardware/Software Partitioning A First
Approach
  • Greg Stitt, Roman Lysecky, Frank Vahid
  • Department of Computer Science and Engineering
  • University of California, Riverside
  • Also with the Center for Embedded Computer
    Systems at UC Irvine

2
Introduction
  • Dynamic optimizations an increasing trend
  • Examples
  • Dynamo
  • Dynamic software optimizations
  • Transmeta Crusoe
  • Dynamic code morphing
  • Just In Time Compilation
  • Interpreted languages
  • Advantages
  • Transparent optimizations
  • No designer effort
  • No tool restrictions
  • Adapts to actual usage

3
Introduction
  • Drawbacks of current dynamic optimizations
  • Currently limited to software optimizations
  • Limited speedup (1.1x to 1.3x common)
  • Alternatively, we could perform hw/sw
    partitioning
  • Achieve large speedups (2x to 10x common)
  • However, presently dynamic optimization not
    possible

Sw ______ ______ ______
4
Introduction
  • Ideally, we would perform hardware/software
    partitioning dynamically
  • Transparent partitioning
  • Supports all sw languages/tools
  • Most partitioning approaches have complex tool
    flows
  • Achieves better results than software
    optimizations
  • gt2x speedup, energy savings
  • Adapts to actual usage
  • Appropriate architecture required
  • Requires a processor and configurable logic

5
Introduction
  • Microprocessor/FPGA single-chip platforms make
    partitioning more attractive
  • More efficient communication, smaller size
  • Higher performance, low power
  • Examples
  • Xilinx Virtex II Pro, Triscend E5/A7, Altera
    Excalibur, Atmel FPSLIC
  • Makes dynamic hw/sw partitioning more feasible
  • However, partitioning must be performed at binary
    level

1990s
2003
6
Introduction
  • Binary-level hw/sw partitioning
  • Binary is profiled and hardware candidates are
    determined
  • Regions to be partitioned are decompiled into
    CDFG
  • CDFG is synthesized to hardware
  • Binary is updated to use hardware
  • Many advantages over source-level partitioning
  • Supports any language or software compiler
  • No change in tools
  • Better software size and performance estimation
    at binary level
  • Enables dynamic hw/sw partitioning

7
Dynamic Hw/Sw Partitioning
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
add
8
Dynamic Hw/Sw Partitioning
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
9
Dynamic Hw/Sw Partitioning
add
add
add
add
add
add
add
add
add
add
add
Dynamic Partitioning Module
add
add
add
add
add
add
add
add
add
add
add
10
Dynamic Hw/Sw Partitioning
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
Dynamic Partitioning Module
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
11
Dynamic Hw/Sw Partitioning
Dynamic Partitioning Module
SW
SW
SW
SW
SW
SW
SW
SW
12
Dynamic Hw/Sw Partitioning
Dynamic Partitioning Module
HW
HW
HW
HW
HW
HW
HW
Frequent Loops
13
Dynamic Hw/Sw Partitioning
Configurable Logic
Frequent Loops
14
Dynamic Partitioning Module
  • Dynamic partitioning module executes partitioning
    tools on chip
  • Profiler, partitioning compiler, synthesis,
    placeroute

SW Source
Profiler
Partitioning Compiler
SW Binary
Synthesis
PlaceRoute
HW
15
Dynamic Partitioning Module
  • Synthesis and place route tools all moved
    on-chip
  • These tools typically execute on powerful
    workstations
  • Most people will cringe at idea of moving these
    tools on-chip
  • However, dynamic partitioning deals with small
    regions of code
  • Typically, small innermost loops
  • Therefore, we can develop lean tools that work
    specifically for these small loops
  • Lean tools make on-chip execution possible
  • Area overhead becoming less critical due to
    Moores Law

16
System Architecture
  • Microprocessors
  • MIPS (may be many)
  • On-chip memory
  • Configurable logic
  • Dynamic partitioning module

17
Dynamic Partitioning Module
  • Dynamically detects frequent loops and then
    reimplements the loops in hardware running on the
    configurable logic
  • Architectural components
  • Profiler
  • Additional processor and memory
  • But SOCs may have dozens anyways
  • Alternatively, we could share main processor

18
Configurable Logic
  • Greatly simplified in order to create lean place
    route tools
  • DMA used to access memory
  • Two registers
  • R0_Input stores data from memory
  • R1_InOut stores temporary data data to write
    back to memory
  • Fabric
  • Supports combinational logic
  • Implies loops must have body implemented in
    single cycle (temporary restriction)

DMA
R0_Input
R1_InOut
Configurable Logic Fabric
19
Configurable Logic Fabric
  • Fabric
  • 3-input 2-output LUTS surrounded by switch
    matrices
  • Switch Matrix
  • Connect wire to same channel on different side
  • LUT
  • 3-input (8 word) 2-output SRAM

Configurable Logic Fabric
Switch Matrix
LUT
20
Tool Overview
  • Tool flow slightly different from standard
    partitioning flow
  • Decompilation
  • Binary modification

21
Loop Profiling
  • Non-intrusive profiler
  • Monitors instruction bus
  • Very little overhead
  • Small cache (16 entries) and 2,300 logic gates
  • Less than 1 power overhead

To L1 Memory
Micro-processor
rd/wr
Frequent Loop Cache
Frequent Loop Cache Controller
addr
rd/wr
data
addr
saturation
sbb

data
data
22
Decompilation
  • Decompilation recovers high-level information
  • Creates optimized CDFG
  • All instruction-set inefficiencies are removed
  • Binary partitioning has been shown to achieve
    similar results to source-level partitioning for
    many applications
  • Greg Stitt, Frank Vahid, ICCAD 2002

23
DMA Configuration
  • Maps memory accesses to our DMA architecture
  • Reads/writes
  • Increment/decrement address updates
  • Single/block request modes
  • Optimizes DFG for DMA
  • Removes address calculations
  • Removes loop counters/exit conditions

r3
24
Register Transfer Synthesis
  • Maps DFG operations to hw library components
  • Adders, Comparators, Multiplexors, Shifters
  • Creates Boolean expression for each output bit in
    dataflow graph by replacing hw components with
    corresponding expressions

r40r10 xor r20, carry0r10 and
r20 r41(r11 xor r21) xor carry0,
carry1 . .
25
Logic Synthesis
  • Optimizes Boolean equations from RT synthesis
  • Large opportunity for logic minimization due to
    use of immediate values in the binary
  • Simple on-chip 2-level logic minimization method
  • Lysecky/Vahid DAC03, session 20.4 (945 Wed)

r20 r10 xor 0 xor 0 r21 r11 xor 0 xor
carry0 r22 r12 xor 1 xor carry1 r23
r13 xor 0 xor carry2
26
Technology Mapping
  • Maps logic operations to 3-input, 2-output LUTs
  • Traverse logic network and combine nodes to
    determine single output LUTs
  • Combine nodes to form two output LUTs

27
Placement
  • Nodes along critical path are placed in single
    horizontal row
  • Build dependencies between remaining nodes and
    placed nodes
  • Use dependencies to place remaining nodes
  • Either above or below placed nodes

28
Routing
  • Greedy algorithm
  • At each switch matrix, choose directionto route
  • Continue to route until reaching switchmatrix
    that is already in use
  • Backtrack to previous switch matrix,and try
    another direction
  • Place and route most complex task
  • currently working on improvements

29
Bitfile Creation
  • Combines placerouted hardware description with
    DMA configuration into bitfile
  • Used to initialize the configurable logic

30
Binary Modification
  • Updates the application binary in order to
    utilize the new hardware
  • Loop replaced with jump to hw initialization code
  • Wisconsin Architectural Research Tool Set (WARTS)
  • EEL (Executable Editing Library)
  • We assume memory is RAM or programmable ROM

loop Load r2, 0(r1) Add r1, r1, 1 Add r3, r3,
r2 Blt r1, 8, loop after_loop ..
  • hw_init
  • Initialize HW registers
  • Enable HW
  • Shutdown processor
  • Woken up by HW interrupt
  • Store any results
  • Jump to after_loop

31
Tool Statistics
  • Executed on SimpleScalar
  • Similar to a MIPS instruction set
  • Used 60 MHz clock (like Triscend A7 device)
  • Statistics
  • Total run time of only 1.09 seconds
  • Requires less than ½ megabyte of RAM
  • Code size much smaller than standard synthesis
    tools

32
Experiments
  • Benchmark Information
  • Powerstone (Brev, g3fax12)
  • NetBench (url)
  • Logic minimization kernel (logmin)
  • Statistics
  • 55 of total time spent in loops that are moved
    to hardware
  • Ideal speedup of 2.8
  • These loops were only 2.4 of the size of the
    original application

33
Experiments
  • Results
  • Achieved average speedup of 2.6, close to ideal
    2.8
  • Hardware loops were 20X faster than software
    loops
  • Even with simple architecture and tools, large
    speedups were achieved

34
Conclusion
  • Dynamic hardware/software partitioning has
    advantages over other partitioning approaches
  • Completely transparent
  • Designers get performance/energy benefits of
    hw/sw partitioning by simply writing software
  • Quality likely not as good as desktop CAD for
    some applications, so most suitable when
    transparency is critical (very often!)
  • Achieved average speedup of 2.6
  • Very close to ideal speedup of 2.8
  • Future work
  • More complex configurable logic fabric
  • Designed in close conjunction with on-chip CAD
    tools
  • Sequential logic and increased inputs/outputs
  • Support larger hardware regions, not just simple
    loops
  • Improved algorithms (especially place and route)
  • Handle more complex memory access patterns
Write a Comment
User Comments (0)
About PowerShow.com