Title: 1
1A Distributed Control Path Architecture for VLIW
Processors
- Hongtao Zhong, Kevin Fan, Scott Mahlke,
- and Michael Schlansker
- Advanced Computer Architecture Laboratory
- University of Michigan
- HP Laboratories
2Motivation
- VLIW Scaling Problem
- Centralized resource
- Highly ported structures
- Wire delays
Register File
Register File
FU
FU
FU
FU
FU
FU
FU
FU
FU
FU
Instruction Fetch/Decode
Instruction Fetch/Decode
3Multicluster VLIW
- Distribute register files
- Cluster function units
- Distribute data caches
- Clusters communicate through interconnection
network - Used in TI C6x, Lx/ST200, Analog Tigersharc
Interconnection network
Cluster 0
Cluster 1
Register File
Register File
FU
FU
FU
FU
Instruction Fetch/Decode
4Control Path Scaling Problem
- Larger I-cache
- Latency
- Long wires for control signals distribution
- Code compression
- Hardware cost, power
- Grow quadratically with the number of FUs
NOP
NOP
B
A
IR
align/shiftnetwork
C
B
A
X
G
F
E
D
PC
I-cache
5Straight Forward Approach
- Distribute I-fetch in spirit similar to
distribution of data path - Local communication of controls
- Reduce latency, hardware cost, power
- Used in Multiflow Trace 14/300 processors
Interconnection network
Interconnection network
Register File
Register File
Register File
Register File
FU
FU
FU
FU
FU
FU
FU
FU
IR
IR
PC
PC
I-cache
I-cache
6DVLIW Approach
- Simple distribution has problems
- Doesnt support code compression
- PC still a centralized resource
Interconnection network
Interconnection network
Register File
Register File
Register File
Register File
FU
FU
FU
FU
FU
FU
FU
FU
IR
IR
align/shift
align/shift
PC
PC0
PC1
I-cache
I-cache
7DVLIW Execution Model
- Clusters execute in lock-step
- When one cluster stalls, all clusters stall
- Clusters collectively execute one thread
- Each cluster runs an instruction stream
- Compiler orchestrates the execution of streams
- Compiler manages communication
- Light weight synchronization
8DVLIW Benefits
- Completely decentralized architecture
- Distributed data path
- Distributed control path
- Supports arbitrary code compression
- Exploiting ILP on multi-core style system
- Good for embedded applications
- Low cost
- Compiler support
9DVLIW Architecture
To cluster 1
To cluster 2
Banked L2
IC
FU
MFU
VLIWCluster 0
VLIWCluster 1
br_target
Register Files
IR
B
NOP
A
align/shift
VLIWCluster 3
VLIWCluster 2
L1 D-Cache
Next PC
B
A
PC
L1 I-Cache
Banked L2
To Banked L2
10Code Organization
DVLIW
Conventional VLIW
- Code for each cluster is consecutive in memory
- Operations in the same MultiOp stored in
different memory locations - Each cluster computes its own next PC
PC
PC0
PC1
11Branch Mechanism
- Maintain correct execution order
- All clusters transfer control at the same cycle
- All clusters branch to the same logical multiop
- Unbundled branch in HPL-PD
Each cluster specifies its own target
PBR btr1, TARGET
Branch
CMPP pr0, (x100)?
Broadcast to all clusters
BR btr1, pr0
Replicated in each cluster
12Branch Handling Example
pbr btr1, BB2 cmpp pr0, (x100)? br btr1, pr0
pbr btr1, BB2 . . br btr1, pr0
pbr btr1, BB2 cmpp pr0, (x100)? bcast pr0 br
btr1, pr0
Cluster 1
Cluster 0
Conventional VLIW
DVLIW
13Sleep Mode
- Idle blocks after distribution
- Put cluster into sleep mode
- Compiler managed
- Save energy
- Reduce code size
- Mode change happens at block boundary
SLEEP
BR
BR
BR
BR
WAKE
Cluster 1
Cluster 0
14Experimental Setup
- Trimaran toolset
- Processor configuration
- 4 clusters, 2 INT, 1 FP, 1 MEM, 1 BR per cluster
- 16K L1 I-cache total
- Perfect data cache assumed
- Power Model
- Verilog for instruction align/shift logic
- Wire model
- Cacti cache model
- 21 benchmarks from MediaBench and SPECINT2000
15Change in Global Communication Bits
MediaBench
SPECINT
16Normalized Energy Consumption on Control Path
Control path energy (align/shift logic energy)
(wire energy) (I-cache energy)
40 saving
67 saving
80 saving
21 saving
17Normalized Code Size
Baseline Conventional VLIW with compressed
encoding Traditional method (single PC) 7x
increase DVLIW 40 increase
18Result Summary
- DVLIW benefits
- Order of magnitude reduction in global
communication - 40 savings in control path energy
- 5x code size reduction vs. simple distribution
- Small overhead for ILP execution on CMP
- 3 increase in execution cycles
- 4 increase in I-cache stalls
19Conclusions
- DVLIW removes last centralized resource in a
multicluster VLIW - Fully distributed control path
- Scalable architecture
- More energy efficient
- Stylized CMP architecture
- Exploit ILP
- Multiple instruction streams
- Compiler orchestrated
20Thank You
- For more information
- http//cccp.eecs.umich.edu