Title: Core-Selectability in Chip-Multiprocessors
1Core-Selectability in Chip-Multiprocessors
- Hashem H. Najaf-abadi
- Niket K. Choudhary
- Eric Rotenberg
2Dividing the DesignA definition
Processing Cores
All levels of cache Interconnect
Ports to Memory and IO
3What this Talk is About
- How to improve performance of a CMP
by enabling exploitation of the full
potential of the interconnection
the interconnect is not fully utilized by all
workloads
by improving the processing
if it is, theres nothing to gain here
4The Provisioning Factor Balance in provisioned
resources
5The Provisioning Factor Balance in provisioned
resources
6The Underutilization Factor Interconnect not
fully utilized by all applications
workloads that depend the most on interconnect
have a louder say in what a well-provisioned
design constitutes
7The One-size-fits-all Factor A single solution
has limited performance
Changing these trade-offs will improve
performance for some workloads and degrade it for
others.
Hes not much for a conversation.
But if he was, it would be a conversation about
saving you execution time.
8The Shrinking Factor Progressively less die area
for the cores
better return on increasing the interconnection
resources
9The Shrinking Factor Progressively less die area
for the cores
10The Shrinking Factor Progressively less die area
for the cores
Intel
8088
Intel386
100
Intel
Intel
IBM
90
8086
80286
Intel 486DX
Power3
80
Intel
70
Intel
Pentium IV
Pentium
Intel
60
Pentium III
50
IBM
IBM
Power4
Power5
Intel Core Duo
40
30
IBM Power6
Niagara-2
-
-
20
Niagara-1
10
1990
2010
2000
2005
1995
11The Diversity Factor Can provide diversity in
the core designs
Single Core Design
Optimized for all workloads
12The Diversity Factor Can provide diversity in
the core designs
Heterogeneous Cores
Optimized for workload
13Core-Selectability
Core-Selectability
Optimized for workload.
14Core-Selectability
Selectability
15Recap
Port Sharing
can improve performance without increasing
power density
results in a homogeneous design
can reduce verification effort by splitting up
workload space
16Core-SelectabilityRemains homogeneous at a high
level
CMP
17Empirical EvaluationBased on Fabscalar
- A library of the synthesized implementation of
different configurations for different
microarchitectural units of a contemporary
superscalar processor.
18The selection of cores
Core-U Core-A Core-B
FETCH STAGES 4 3 5
DECODE STAGES 1 1 1
RETIRE STAGES 2 2 2
ISSUE WIDTH 3 2 5
ROB SIZE 512 1024 512
IWINDOW SIZE 64 128 32
Clock period .6ns .6ns .6ns
normalized exec. time
19On Individual Benchmarks
normalized execution time
20The Effect of Selectability
normalized exec. time
21Under Different Task Arrival Patterns
Average task turnaround time for (a) normal
traffic, and (b) bursty traffic.
22Overhead of Reconfigurability
Issue-Q size Wakeup Delay Select Delay Wake Select Delay Reconfig. Delay
16 0.55ns 0.54ns 1.09ns 1.55ns
32 0.63ns 0.59ns 1.38ns 1.89ns
64 0.67ns 0.65ns 1.62ns 2.10ns
128 0.82ns 0.76ns 2.00ns 2.30ns
23Implementation of Port Sharing
26ps added propagation delay
24Overhead of Reconfigurability
- With reconfigurability, change is implemented
within a core with complex coupling between
pipeline stages. - With Core-Selectability, change is implemented at
the core level with less complex
coupling between core and interconnect.
25Its as if he knows you like to save execution
time.