Title: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis
1Warp Processing Making FPGAs Ubiquitous via
Invisible Synthesis
- Greg Stitt
- Department of Electrical and Computer Engineering
- University of Florida
2Introduction
- Improved performance enables new applications
- Past decade - Mp3 players, portable game
consoles, cell phones, etc. - Future architectures - Speech/image recognition,
self-guiding cars, computation biology, etc.
3Introduction
- FPGAs (Field Programmable Gate Arrays)
Implement custom circuits - 10x, 100x, even 1000x for scientific and embedded
apps - Najjar 04He, Lu, Sun 05Levine, Schmit
03Prasanna 06Stitt, Vahid 05, - But, FPGAs not mainstream
- Warp Processing Goal Bring FPGAs into mainstream
- Make FPGAs Invisible
FPGAs capable of large performance improvements
Performance
uP
FPGA
4Introduction Hardware/Software Partitioning
C Code for FIR Filter
for (i0 i lt 16 i) yi ci
xi .. .. ..
for (i0 i lt 128 i) yi ci
xi .. .. ..
5Introduction High-level Synthesis
- Problem Describing circuit using HDL is time
consuming/difficult - Solution High-level synthesis
- Create circuit from high-level code
- Gupta, DeMicheli 92Camposano, Wolf 91Rabaey
96Gajski, Dutt 92 - Allows developers to use higher-level
specification - Potentially, enables synthesis for software
developers
6Introduction High-level Synthesis
- Problem Describing circuit using HDL is time
consuming/difficult - Solution High-level synthesis
- Create circuit from high-level code
- Gupta, DeMicheli 92Camposano, Wolf 91Rabaey
96Gajski, Dutt 92 - Allows developers to use higher-level
specification - Potentially, enables synthesis for software
developers
for (i0 i lt 16 i) yi ci xi
7Problems with High-Level Synthesis
- Problem High-level synthesis is unattractive to
software developers - Requires specialized language
- SystemC, NapaC, HandelC,
- Requires specialized compiler
- Spark, ROCCC, CatapultC,
- Limited commercial success
- Software developers reluctant to change tools
uP
FPGA
8Warp Processing Invisible Synthesis
- Solution Make synthesis invisible
- 2 Requirements
- Standard software tool flow
- Perform compilation before synthesis
- Hide synthesis tool
- Move synthesis on chip
- Similar to dynamic binary translation
- Transmeta
- But, translate to hw
9Warp Processing Invisible Synthesis
- Solution Make synthesis invisible
- 2 Requirements
- Standard software tool flow
- Perform compilation before synthesis
- Hide synthesis tool
- Move synthesis on chip
- Similar to dynamic binary translation
- Transmeta
- But, translate to hw
Warp processor looks like standard uP but
invisibly synthesizes hardware
uP
FPGA
10Warp Processing Invisible Synthesis
- Advantages
- Supports all languages,compilers, IDEs
- Supports synthesis of assembly code
- Support synthesis of library code
- Also, enables dynamic optimizations
Warp processor looks like standard uP but
invisibly synthesizes hardware
uP
FPGA
11Warp Processing Background Basic Idea
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
12Warp Processing Background Basic Idea
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
13Warp Processing Background Basic Idea
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
14Warp Processing Background Basic Idea
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
15Warp Processing Background Basic Idea
5
On-chip CAD converts critical region into control
data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
16Warp Processing Background Basic Idea
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
17Warp Processing Background Basic Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
18Warp Processing Background Basic Idea
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
19Warp Processing Background Basic Technology
- Challenge CAD tools normally require powerful
workstations - Develop extremely efficient on-chip CAD tools
- Requires efficient synthesis
- Requires specialized FPGA, physical design tools
(JIT FPGA compilation) - Lysecky FCCM05/DAC04, University of
Arizona
46x improvement 30 perf. penalty
JIT FPGA compilation
20Warp Processing Initial Results
- Embedded Applications
- Average speedup of 6.3x
- Achieved completely transparently
- Also, energy savings of 66
21Thread Warping - Overview
for (i 0 i lt 10 i) thread_create( f, i
)
Multi-core platforms ? multi-threaded apps
Performance
OS schedules threads onto accelerators (possibly
dozens), in addition to µPs
Compiler
Very large speedups possible parallelism at
bit, arithmetic, and now thread level too
µP
µP
FPGA
Binary
f()
OS schedules threads onto available µPs
µP
µP
µP
f()
OS
OS invokes on-chip CAD tools to create
accelerators for f()
Thread warping use one core to create
accelerator for waiting threads
Remaining threads added to queue
22Speedup from Thread Warping
But, FPGA uses additional area
So we also compare to systems with 8 to 64 ARM11
uPs FPGA size 36 ARM11s
- 11x faster than 64-core system
- Simulation pessimistic, actual results likely
better
23Dynamic enables Custom Communication
Problem Best topology is application dependent
NoC Network on a Chip provides communication
between multiple cores
App1
µP
µP
Bus
Mesh
App2
µP
µP
Bus
Mesh
24Dynamic enables Custom Communication
Problem Best topology is application dependent
NoC Network on a Chip provides communication
between multiple cores
App1
FPGA
Bus
Mesh
App2
Bus
Mesh
Warp processing can dynamically choose topology
25Summary
- Warp processors
- Achieves performance advantages of FPGA without
any extra effort - Invisible synthesis
- Allows designers to use existing tools/languages
- Enables dynamic hardware optimization
- Thread warping
- Dynamic synthesis of thread accelerators for
multi-cores - Custom communication
- Warp processing can adapt communication topology
to needs of application or a particular workload
26References
- Patent
- Warp Processor for Dynamic Hardware/Software
Partitioning. F. Vahid, R. Lysecky, G. Stitt.
Patent Pending, 2004 - Hardware/Software Partitioning of Software
Binaries G. Stitt and F. VahidIEEE/ACM
International Conference on Computer Aided Design
(ICCAD), 2002, pp. 164- 170. - Warp Processors R. Lysecky, G. Stitt, and F.
Vahid. ACM Transactions on Design Automation of
Electronic Systems (TODAES), 2006, Volume 11,
Number 3, pp. 659-681. - Binary Synthesis G. Stitt and F. Vahid Accepted
for publication in ACM Transactions on Design
Automation of Electronic Systems (TODAES) - Expandable Logic G. Stitt, F. Vahid Submitted
to IEEE/ACM Conference on Design Automation
(DAC), 2007. - New Decompilation Techniques for Binary-level
Co-processor Generation G. Stitt, F. Vahid
IEEE/ACM International Conference on Computer
Aided Design (ICCAD), 2005, pp. 547-554. - Hardware/Software Partitioning of Software
Binaries A Case Study of H.264 Decode G.Stitt,
F. Vahid, G. McGregor, B. Einloth IEEE/ACM/IFIP
International Conference on Hardware/Software
Codesign and System Synthesis (CODES/ISSS), 2005,
pp. 285-290. - A Decompilation Approach to Partitioning Software
for Microprocessor/FPGA Platforms. G. Stitt and
F. Vahid IEEE/ACM Design Automation and Test in
Europe (DATE), 2005, pp.396-397. - Dynamic Hardware/Software Partitioning A First
Approach G. Stitt, R. Lysecky and F. Vahid
IEEE/ACM Conference on Design Automation (DAC),
2003, pp. 250-255.