Title: Inherently%20Lower-Power%20High-Performance%20Superscalar%20Architectures
1Inherently Lower-Power High-Performance
Superscalar Architectures
Paper Review
- Rami Abielmona
- Prof. Maitham Shams
- 95.575
- March 4, 2002
2Flynns Classifications (1972) 1
- SISD Single Instruction stream, Single Data
stream - Conventional sequential machines
- Program executed is instruction stream, and data
operated on is data stream - SIMD Single Instruction stream, Multiple Data
streams - Vector machines (superscalar)
- Processors execute same program, but operate on
different data streams - MIMD Multiple Instruction streams, Multiple
Data streams - Parallel machines
- Independent processors execute different
programs, using unique data streams - MISD Multiple Instruction streams, Single Data
stream - Systolic array machines
- Common data structure is manipulated by separate
processors, executing different instruction
streams (programs)
3Pipelined Execution
- Effective way of organizing concurrent activity
in a computer system - Makes it possible to execute instructions
concurrently - Maximum throughput of a pipelined processor is
one instruction per clock cycle - Shown in figure 1 is a two-stage pipeline, with
buffer B1 receiving new information at the end of
each clock cycle
Figure 1, courtesy 2
4Superscalar Processors
- Processors equipped with multiple execution
units, in order to handle several instructions in
parallel 11 - Maximum throughput is greater than one
instruction per cycle (multiple-issue) - Baseline architecture is shown in figure 2 3
Figure 2, courtesy 3
5Important Terminology 2 4
- Issue Width The metric designating how many
instructions are issued per cycle - Issue Window Comprises the last n entries of
the instruction buffer - Register File Set of n-byte, dual read,
single write bank of registers - Register Renaming Technique used to prevent
stalling the processor for false data
dependencies between instructions - Instruction Steering Technique used to send
decoded instructions to appropriate memory
banks - Memory Disambiguation Unit Mechanism for
enforcing data dependencies through memory
6Motivations and Objectives
- To analyze the high-end processor market for
BOTH power and area-performance trade-offs (not
previously done) - To propose a superscalar architecture which
achieves a power reduction without compromising
performance - Analysis to be carried out on structures that
increase energy dissipation, with an increasing
issue width - Register rename logic
- Instruction issue window
- Memory disambiguation unit
- Data bypass mechanism
- Multiported register file
- Instruction and data caches
7Energy Models 5
- Model 1 Multiported RAM
- Access energy (R or W) Edecode Earray ESA
EctlSA Epre Econtrol - Word line energy Vdd2 Nbits( Cgate Wpass,r (
2Nwrite Nread ) Wpitch Cmetal ) - Bit line energy Vdd Mmargin Vsense Cbl,read
Nbits - Model 2 CAM (Content-Addressable Memory)
- Using IW write word lines and IW write bitline
pairs - Model 3 Pipeline latches and clocking tree
- Assume balanced clocking tree (less power
dissipation than grids) - Assume lower power single phase clocking scheme
- Near minimum transistor sizes used in latches
- Model 4 Functional Units
- Eavg Econst Nchange x Echange
- Energy complexity is independent of issue width
8Preliminary Simulation Results
E (IW)?
- Wrote own simulator, incorporating all the
developed energy models (based on SimpleScalar
tool set) - Ran simulations for 5 superscalar designs, with
IW ranging from 4 to 16 - Results show that total committed energy
increases with IW, as wider processors rely on
deeper speculation to exploit ILP - Energy/instruction grows linearly for all
structures except functional units (FUs) - Results obtained using 35-micron and Vdd 3.3V
technologies, which FUs scale well with.
However, RAMs, CAMs and long wires do not scale
well, and thus have to be LOW-POWER structures
Structure Energy growth parameter
Register rename logic ? 1.1
Instruction issue window ? 1.9
Memory disam. unit ? 1.5
Multiported register file ? 1.8
Data bypass mechanism ? 1.6
Functional units ? 0.1
All caches ? 0.7
Table 1
9Problem Formulation
- Energy-Delay Product
- E x D energy/operation x cycles/operation
- E x D (energy/cycle) / IPC2
- E x D IPC x (IW)? / IPC2 (IW)? / IPC
- E x D (IW)? - a (IPC) (? a) / a
- Problem Definition
- If a 1, then E x D (IPC)?-1 (IW)?-1
- If a 0.5, then E x D (IPC)?-1/2 (IW)2?-1
- Need new techniques to achieve more ILP with
conventional superscalar design
10Intermediary Recap
- We have discussed
- Superscalar processsor design and terminology
- Energy modeling of microarchitecture structures
- Analysis of energy-delay metric
- Preliminary simulation results
- We will introduce
- General solution methodology
- Previous decentralization schemes
- Proposed strategy
- Simulation results of multicluster architecture
- Conclusions
11General Design Solution
- Decentralization of microarchitecture
- Replace tightly coupled CPU with a set of
clusters, each capable of superscalar processing - Can ideally reduce ? to zero, with good cluster
partitioning techniques - Solution introduces the following issues
- Additional paths for intercluster communication
- Need for cluster assignment algorithms
- Interaction of cluster with common memory system
12Previous Decentralized Solutions
Particular Solution Main Features
Limited Connectivity VLIWs 6 RF is partitioned into banks Every operation specifies a destination bank
Multiscalar Architecture 7 PEs organized in a circular chain RF is decentralized
Trace Window Architecture 8 RF and issue window are partitioned All instructions must be buffered
Multicluster Architecture 9 RF, issue window and FUs are decentralized Special instruction used for intercluster communication
Dependence-Based Architecutre 10 Contains instruction dispatch intelligence
2-cluster Alpha 21264 Processor 11 Both clusters contain copy of RF
13Proposed Multicluster Architecture (1)
- Instead of tightly coupled CPUs, proposed
architecture will involve a set of clusters, each
containing - instruction issue window
- local physical register file
- set of execution units
- local memory disambiguation unit
- one bank of interleaved data cache
- Refer to figure 3 on next slide
14Proposed Multicluster Arch. (2)
Figure 3
15Multicluster Architecture Details
- Register Renaming and Instruction Steering
- Each cluster is provided with a local physical
RF - Global Map Table maintains mapping between
architectural registers and physical registers - Cluster Assignment Algorithm
- Tries to minimize
- intercluster register dependencies
- delay through cluster assignment logic
- Whole graph solution is NP-complete, therefore
near-optimal solutions devised by divide
conquer method - Intercluster Communication
- Remote Access Window (RAW) used for remote RF
calls - Remote Access Buffer (RAB) used to keep the
remote source operand - One cycle penalty incurred for a remote RF
16Multicluster Architecture Details(Contd)
- Memory Dataflow
- Centralized memory disambiguation unit does not
scale with increasing issue width and bigger
sizes of the load/store window - Proposed scheme Every cluster is provided with
a local load/store window that is hardwired to a
particular data cache bank - Developed a bank predictor in order to combat
not knowing which cluster the instruction is
being routed to at the decode stage - Stack Pointer (SP) References
- Realized an eager mechanism for handling SP
references - With a new reference to SP, an entry is
allocated in RAB - Upon instruction completion, results written
into RF and RAB - RAB entry is not freed after instruction reads
contents - RAB entry is freed only when a new SP reference
commits
17Results and Analysis
- A single address transfer bus is sufficient for
handling intercluster address transfers - A single bus is used to handle intercluster data
transfers arising from bank mispredictions - 4-6 entries are used in the RAB for low-power
- 2 extra entries are sufficient in the RAB for SP
refs. - Intercluster traffic is reduced by 20 and
performance improved by 3 using SP eager
mechanism - Multicluster architecture showed 20 better
performance than the best configurations with
centralized architectures, with a 50 reduction
in power dissipation
18Conclusions
- Main Result of Work
- Using this architecture will allow the
development of high-performance processors while
keeping the microarchitecture energy-efficient,
as proven by the energy-delay product - Main Contribution of Work
- A methodology for doing energy-efficiency
analysis was derived for use with the next
generation high-performance decentralized
superscalar processors - Other Major Contributions
- Opened analysts eyes to the 3-D IPC-area-energy
space - A roadmap for future high-performance low-power
microprocessor development has been proposed - Coined the energy-efficient family concept,
composed of equally optimal energy-efficient
configurations
19References (1)
- 1 M.J. Flynn, Very High-Speed Computing
Systems, Proceedings of the IEEE, vol. 54,
December 1966, p.p. 1901-1909. - 2 C. Hamacher, Z. Vranesic and S. Zaky,
Computer Organization, fifth edition,
McGraw-Hill New York, 2002. - 3 V. Zyuban and P. Kogge, Inherently
Lower-Power High-Performance Superscalar
Architectures, IEEE Transactions on Computers,
vol. 50, no. 3, March 2001, p.p. 268-285. - 4 E. Rotenberg, AR-SMT A Microarchitectural
Approach to Fault Tolerance in Microprocessors,,
Proceedings of the 29th Fault-Tolerant Computing
Symposium, June 1999 - 5 V. Zyuban, Inherently Lower-Power
High-Performance Superscalar Architectures, PhD
thesis, Univ. of Notre Dame, Mar. 2000. - 6 R. Colwell et al., A VLIW Architecture for a
Trace Scheduling Compiler, IEEE Trans.
Computers, vol. 37, no. 8, pp. 967-979, Aug.
1988. - 7 M. Franklin and G.S. Sohi, The Expandable
Split Window Architecture for Exploiting
Fine-Grain Parallelism, Proc. 19th Ann. Intl
Symp. Microarchitecture, May 1992.
20References (2)
- 8 S. Vajapeyam and T. Miltra, Improving
Superscalar Instruction Dispatch and Issue by
Exploiting Dynamic Code Sequences, Proc. 24th
Ann. Intl Symp. Computer Architecture, June
1997. - 9 K. Farkas, P. Chow, N. Jouppi, and Z.
Vranesic, The Multicluster Architecture
Reducing Cycle Time through Partitioning, Proc.
30th Ann. Intl Symp. Microarchitecture, Dec.
1997. - 10 S. Palacharla, N. Jouppi and J. Smith,
Complexity-Effective Superscalar Processor,
Proc. 24th Ann. Intl Symp. Computer
Architecture, pp. 206-218, June 1997. - 11 K. Hwang, Advanced Computer Architecture
Parallelism, Scalability, Programmability,
McGrawHill New York, 1993.
21Questions/Comments