Title: Integrating%20Post-programmability%20Into%20the%20High-level%20Synthesis%20Equation*
1Integrating Post-programmability Into the
High-level Synthesis Equation
- Scott Mahlke
- Advanced Computer Architecture Laboratory
- University of Michigan
- Ann Arbor, MI USA
- This is work done by Kevin Fan and Manjunath
Kudlur at UM
2Application Engines Differentiate Consumer SoCs
Slide Courtesy of Synfora
3The HLS Equation
- What about programmability?
- How to deal with application changes?
- Time to market
4Substrate Determines Programmability
.5-5 MOPS/mW
10-100 MOPS/mW
Flexibility
Embedded Processor
DSP (e.g. TI 320CXX )
100-1000 MOPS/mW
Reconfigurable Processors (Maia)
Embedded
Factor of 100-1000
FPGA
Direct Mapped
Area or Power
Hardware
5How Much Programmability?
Just Enough!
6StreamRoller Approach
Loop 1
Frame Type?
Loop 2
Loop 3
Loop 4
Block 5
Application
7LA Programmability Shortcomings
8Programmable Loop Accelerator
CRF
Literals
Point-to-point Connections
Bus
Control Memory
Local Mem
/-
/
MEM
BR
Controlsignals
RR
RR
RR
RR
9Mapping New Loops onto a PLA
Loop
Move Insertion
SMT Scheduling
Register Allocation
Control Signals
Machine description
Increment II
- Large search space, few solutions
- Op-centric approaches unable to find solutions
- Satisfiability Modulo Theory (SMT) formulation to
solve linear and SAT constraints simultaneously
10Area Comparison 130nm Library
LA single function accelerator, PLA
programmable accelerator, OR1K OR-1200 processor
11Power Comparison
1.0 power for single function LA, OR1K-equiv
performance equivalent processor
12Efficiency Comparison
200 MIPS/mW
20 MIPS/mW
2 MIPS/mW
13Programmability Assessment
Number of algorithm perturbations tolerated while
maintaining the same performance
14Final Thoughts
- Programmability not an all or nothing issue
- Application accelerators need to be able to
evolve - HLS targeted design generalizations yield a
highly customized, but semi-programmable ASIC - Bottom line tradeoffs
- PLA vs OR-1200 4 - 34x more power efficient, 30x
smaller - PLA vs ASIC 2 - 9x worse power, 2x larger
- Cost breakdown
- Addressable register storage and generalized FUs
most costly - Interconnect extensions less costly
15For More Information
- Modulo Scheduling for Highly Customized
Datapaths to Increase Hardware Reusability, K.
Fan, H. Park, M. Kudlur, and S. Mahlke, Proc.
2008 International Symposium on Code Generation
and Optimization, Apr. 2008, pp. 124-133. - Orchestrating the Execution of Stream Programs
on Multicore Platforms, M. Kudlur and S. Mahlke,
Proc. ACM SIGPLAN 2008 Conference on Programming
Languages Design and Implementation, Jun. 2008.
http//cccp.eecs.umich.edu