Title: Processor%20Acceleration%20Through%20Automated%20Instruction%20Set%20Customization
1Processor Acceleration Through Automated
Instruction Set Customization
- Nathan Clark, Hongtao Zhong, Scott Mahlke
- Advanced Computer Architecture Lab
- University of Michigan, Ann Arbor
- December 3, 2003
2Motivation
- Cell phones, PDAs, digital cameras, etc. are
everywhere - High performance yet low power design point
- General core ASIC solution
- Limited post-programmability
- General core application specific instructions
(CFUs)
CPU
CFU
3What is a CFU?
- Combine multiple primitive operations
- Smaller code size, fewer RF reads
- Increases performance
CFU 1
1
1
2
2
1
4Automation is Key
- This is ¼ of the DFG for a single basic block of
blowfish
159 XOR
164 SHR
173 AND
5Related Work
- Tensilica Xtensa
- Commercial example
- MIPS core manually constructed CFU
- Automatic instruction set synthesis is mature
field - See paper for comparison of techniques
- Our contributions
- Novel technique for automatic CFU creation
- System to utilize CFUs in multiple applications
- Analysis of how effectively CFUs for one
application apply to other applications in the
same domain
6System Overview
- Synthesis
- Subgraph identification
- Discover candidates for CFUs
- Weed out what shouldnt be picked
- Selection
- Determine which candidates to use as CFUs
- Compilation
- Subgraph replacement
- Make use of the CFUs in a range of applications
7Subgraph Identification
- Grow subgraphs from seed nodes
- All nodes are seeds
- Most directions dont make sense
- How to decide where to grow?
- Making decisions using factors similar to an
architect - Take 4 factors into consideration
- Criticality, Latency, Area, Input/Output
ltlt
8Subgraph Identification
- Grow subgraphs from seed nodes
- All nodes are seeds
- Most directions dont make sense
- How to decide where to grow?
- Making decisions using factors similar to an
architect - Take 4 factors into consideration
- Criticality, Latency, Area, Input/Output
ltlt
CFU Candidates
ltlt
9Subgraph Identification
- Grow subgraphs from seed nodes
- All nodes are seeds
- Most directions dont make sense
- How to decide where to grow?
- Making decisions using factors similar to an
architect - Take 4 factors into consideration
- Criticality, Latency, Area, Input/Output
- Sum of these factors determines value of each
direction - NOT picking CFUs
ltlt
CFU Candidates
ltlt
10Critical Path
- Combining operations on the critical path will
shrink the longer dependence chains - Maximize potential performance gain
- Wt
- Slack is cycles off longest dependence path
10/(01) 10
10/(21) 3.33
gtgt
gtgt
gtgt
ltlt
ltlt
ltlt
ltlt
11Latency
- Growing toward low latency operations allows
combination of more nodes in a cycle - Maximize DFG compression
- Wt
gtgt
gtgt
gtgt
100.3 / 0.36 8.33
Opcode Area Cycles
1.00 0.30
0.12 0.06
ltlt, gtgt 0.01 0.00
0.16 0.09
ltlt
ltlt
ltlt
ltlt
100.3 / 0.6 5
12Area
- Want the most benefit for the least area
- Wt
- Area is the sum of macrocell areas
100.5/0.5 10
100.5/1.5 3.33
Opcode Area Cycles
1.00 0.30
0.12 0.06
ltlt, gtgt 0.01 0.00
0.16 0.09
13Input/Output
- Want CFUs to use as few RF ports as possible
- Smaller encoding
- Allow growth of larger candidates
- Wt
102/(41) 4
gtgt
gtgt
gtgt
102/(21) 6.67
ltlt
ltlt
ltlt
ltlt
14Example
28.5
35
30.8
37.5
28.5
37.5
gtgt
gtgt
gtgt
ltlt
ltlt
ltlt
ltlt
15Example
28.5
35
30.8
28.5
40
gtgt
gtgt
gtgt
33.5
ltlt
ltlt
ltlt
ltlt
16Example
28.5
35
30.8
28.5
gtgt
gtgt
gtgt
36
36
ltlt
ltlt
ltlt
ltlt
17Example
gtgt
gtgt
gtgt
ltlt
ltlt
ltlt
ltlt
18Example
gtgt
gtgt
gtgt
ltlt
ltlt
ltlt
ltlt
19Example
gtgt
gtgt
gtgt
ltlt
ltlt
ltlt
ltlt
20Example
gtgt
gtgt
gtgt
ltlt
ltlt
ltlt
ltlt
21Example
gtgt
gtgt
gtgt
ltlt
ltlt
ltlt
ltlt
22Example
gtgt
gtgt
gtgt
ltlt
ltlt
ltlt
ltlt
23Example
gtgt
gtgt
gtgt
ltlt
ltlt
ltlt
ltlt
24Finished Met External Constraints
25Set of Candidates
ltlt
ltlt
ltlt
ltlt
ltlt
ltlt
ltlt
ltlt
ltlt
ltlt
ltlt
ltlt
ltlt
26Avoids Exponential Explosion
27Greedy Selection Heuristic
- Use estimates of performance improvement / cost
Subgraph Number Value Cost Ops
1 20 4 (3,4),(6,8)
2 6 1 (1,3,7)
N 9 5 (1,7)
Subgraph Number Value Cost Ops
1 10 4 (6,8)
2 6 1 (1,3,7)
N 0 5
28Compiler Replacement
- Multiple applications can utilize CFUs
- Vflib pattern matcher Cor 99
Instruction Synthesis
CFU Description
Compiler
29Experimental Setup
- Implemented in the Trimaran toolset
- Baseline machine 1 Int, 1 Flt, 1 Br, 1 Mem/Cycle
- CFUs use Int issue slot
- CFU latency/area generated as sum of each
individual macrocell - Pipeline latches were added if CFU latency gt1
clock cycle - 300 MHz clock assumed
- No branch or memory instructions in CFUs
- Four application domains tested
- Audio, Encryption, Image, Network
30Native Encryption Results
31Encryption Cross Compile
32Generalizing CFUs
Subsumed (Multiple Paths)
Wildcards (Multiple Nodes)
IN_1
0x8, 0x0
IN_1
0x8
gtgt
gtgt
0xF, 0x0
0xF
,
IN_2
IN_2
,-
33Effects of Generalization
Speedup
34Conclusions
- Developed two phase instruction set synthesis
system - Guide function removes bad candidates
- Greedy selection heuristic
- Substantial speedups can be attained with very
little die impact - Subsumed subgraphs and wildcarding increase
cross-application effectiveness
Domain Encryption Network Image Audio
Ave. Speedup 1.61 1.38 1.16 1.66
35Questions?
http//cccp.eecs.umich.edu
36Backup slides
37Individual Factors - Blowfish
38Individual Factors - Djpeg
39Selection
- Uses estimates of performance improvement
- Greedy Heuristic used
gtgt
gtgt
gtgt
ltlt
ltlt
ltlt
ltlt