Title: 12th Lecture: Technological Trends and Future Processor Alternatives
112th Lecture Technological Trends and Future
Processor Alternatives
- Todays lecture
- Microprocessors today
- Trends and principles in the Giga Chip Era
- Future Processor Alternatives
2Microprocessors today (Y2K)
- Chip-Technology 2000/01
- 0.18-?m CMOS-technology, 10 to 100 M transistors
per chip, 600 MHz to 1.4 GHz cycle rate - Example processors
- Intel Pentium III 7.5 M transistors, 0.18 ?m
(Bi-)CMOS, up to 1 GHz - Intel Pentium 4 ?? Transistors, 0.18 ?m
(Bi-)CMOS, up to 1.4 GHz, uses a Trace Cache! - Intel IA-64 Itanium (already announced for
2000?) 0.18 ?m CMOS, 800 MHz successor
McKinley (announced for 2001) gt 1 GHz - Alpha 21364 100 M transistors, 0.18 ?m CMOS (1.5
volt, 100 watt), ? MHz - HAL SPARC uses Trace Cache and Value Prediction
- Alpha 21464 will be 4-way simultaneous
multithreaded - Sun MAJC will be a two processor-chip, each
processor a 4-way block-multithreaded VLIW
35. Future processors to use fine-grain
parallelism5.1 Trends and principles in the Giga
Chip Era
- Forecasting the effects of technology is hard
- Everything that can be invented has been
invented. US Commissioner of Patents, 1899. - I think there is a world market for about five
computers. Thomas J. Watson Sr., IBM founder,
1943.
4Microprocessors tomorrow (Y2K-2012)
- Moore's Law number of transistors per chip
double every two years - SIA (semiconductor industries association)
prognosis 1998
5Design Challenges
- increasing clock speed,
- the amount of work that can be performed per
cycle, - and the number of instructions needed to perform
a task. - Today's general trend toward more complex designs
is opposed by the wiring delay within the
processor chip as main technological problem. - higher clock rates with subquarter-micron
designs? on-chip interconnecting wires cause a
significant portion of the delay time in
circuits. - Especially global interconnects within a
processor chip cause problems with higher clock
rates. - Maintaining the integrity of a signal as it moves
from one end of the chip to the other becomes
more difficult. - Copper metallization is worth a 20 to 30
reduction in wiring delay.
6Application and Economy-Related Trends
- Applications
- generally user interactions like video, audio,
voice recognition, speech processing, and 3D
graphics. - large data sets and huge databases,
- large data mining applications,
- transaction processing,
- huge EDA applications like CAD/CAM software,
- virtual reality computer games,
- signal processing and real-time control.
- Colwell from Intel the real threat for processor
designers is shipping 30 million CPUs only to
discover they are imperfect and cause a recall. - Economies of scale
- Fabrication plants now cost about 2 billion, a
factor of ten more than a decade ago.
Manufacturers can only sustain such development
costs if larger markets with greater economies of
scale emerge. ? Workloads will concentrate on
human computer interface. ? Multimedia workloads
will grow and influence architectures.
7Architectural Challenges and Implications
- Preserve object code compatibility (may be
avoided by a virtual machine that targets
run-time ISAs) - It is necessary to find ways of expressing and
exposing more parallelism to the processor. It is
doubtful if enough ILP is available. - Buses may probably scale. Expect much wider buses
in future. - Memory bottleneck Memory latency may be solved
by a combination of technological improvements in
memory chip technology and by applying advanced
memory hierarchy techniques (other authors
disagree). - Power consumption for mobile computers and
appliances. - Soft errors by cosmic rays of gamma radiation may
be faced with fault-tolerant design through the
chip.
8Possible solutions
- a focus of processor chips on particular market
segments - multimedia pushes desktop personal computers
while high-end microprocessors will serve
specialized applications - integrate functionalities to systems on a chip
- partition a microprocessor in a client chip part
that focuses on general user interaction enhanced
by server chip parts that are tailored for
special applications - a CPU core that works like a large ASIC block and
that allows system developers to instantiate
various devices on a chip with a simple CPU core
- and reconfigurable on-chip parts that adapt to
application requirements. - Functional partitioning becomes more important!
9Future Processor Architecture Principles
- Speed-up of a single-threaded application -
today - Trace cache
- Superspeculative
- Advanced superscalar
- Speed-up of multi-threaded applications lecture
13 - Chip multiprocessors (CMPs)
- Simultaneous multithreading
- Speed-up of a single-threaded application by
multithreading lecture 14 - Multiscalar processors
- Trace processors
- DataScalar
- Exotics lecture 15
- Processor-in-memory (PIM) or intelligent RAM
(IRAM) - Reconfigurable
- Asynchronous
10Processor Techniques that Speedup Single threaded
Applications
- Trace Cache tries to fetch from dynamic
instruction sequences instead of the static code
cache - Advanced superscalar processors scale current
designs up to issue 16 or 32 instructions per
cycle. - Superspeculative processors enhance wide-issue
superscalar performance by speculating
aggressively at every point.
11The Trace Cache
- Trace cache is a new paradigm for caching
instructions. - A Trace cache is a special I-cache that captures
dynamic instruction sequences in contrast to the
I-cache that contains static instruction
sequences. - Like the I-cache, the trace cache is accessed
using the starting address of the next block of
instructions. - Unlike the I-cache, it stores logically
contiguous instructions in physically contiguous
storage. - A trace cache line stores a segment of the
dynamic instruction trace across multiple,
potentially taken branches.
12The Trace Cache (2)
- Each line stores a snapshot, or trace, of the
dynamic instruction stream. - The trace construction is of the critical path.
- As a group of instructions is processed, it is
latched into the fill unit. - The fill unit maximizes the size of the segment
and finalizes a segment when the segment can be
expanded no further. - The number of instructions within a trace is
limited by the trace cache line size. - Finalized segments are written into the trace
cache. - Instructions can be sent from the trace cache
into the reservation stations (??) without having
to undergo a large amount of processing and
rerouting. - It is under research if the instructions in the
Trace cache are - fetched but not yet decoded
- decoded but not yet renamed
- or decoded and partly renamed
- Trace cache placement in the microarchitecture is
dependent on this decision.
13The Trace Cache- Performance
- Three applications from the SPECint95 benchmarks
are simulated on a 16-wide issue machine with
perfect branch prediction - (see Patt-paper).
145.3 Superspeculative Processors
- Idea Instructions generate many highly
predictable result values in real programs gt
Speculate on source operand values and begin
execution without waiting for result from the
previous instruction.Speculate about true data
dependences!! - reasons for the existence of value locality
- Due to register spill code the reuse distance of
many shared values is very short in processor
cycles. Many stores do not even make it out of
the store queue before their values are needed
again. - Input sets often contain data with little
variation (e.g., sparse matrices or text files
with white spaces). - A compiler often generates run-time constants due
to error-checking, switch statement evaluation,
and virtual function calls. - The compiler also often loads program constants
from memory rather than using immediate
operands. - See M. H. Lipasti, J. P. Shen Superspeculative
Microarchitecture for Beyond AD 2000. IEEE
Computer, Sept. 1997, pp. 59-66
15Strong- vs. Weak-dependence Model
- Strong-dependence model for program execution a
total instruction ordering of a sequential
program. - Two instructions are identified as either
dependent or independent, and when in doubt,
dependences are pessimistically assumed to exist.
- Dependences are never allowed to be violated and
are enforced during instruction processing. - To date, most machines enforce such dependences
in a rigorous fashion. - This traditional model is overly rigorous and
unnecessarily restricts available parallelism. - Weak-dependence model
- specifying that dependences can be temporarily
violated during instruction execution as long as
recovery can be performed prior to affecting the
permanent machine state. - Advantage the machine can speculate aggressively
and temporarily violate the dependences. The
machine can exceed the performance limit imposed
by the strong-dependence model.
16Implementation of a Weak-dependence Model
- The front-end engine assumes the weak-dependence
model and is highly speculative, predicting
instructions to aggressively speculate past
them. - The back-end engine still uses the
strong-dependence model to validate the
speculations, recover from misspeculation, and
provide history and guidance information to the
speculative engine.
17Superflow processor (Lipasti and Shen 1997)
- The Superflow processor speculates on
- instruction flow two-phase branch predictor
combined with trace cache - register data flow dependence prediction
predict the register value dependence between
instructions - source operand value prediction
- constant value prediction
- value stride prediction speculate on constant,
incremental increases in operand values - dependence prediction predicts inter-instruction
dependences - memory data flow prediction of load values, of
load addresses and alias prediction - Superflow simulations 7.3 IPC for SPEC95 integer
suite, up to 9 instructions per cycle when 32
instructions are potentially issued per cycle - With dependence and value prediction a three
cycle issue nearly matches the performance of a
single issue dispatch.
18SuperflowProcessorProposal