12th Lecture: Technological Trends and Future Processor Alternatives - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

12th Lecture: Technological Trends and Future Processor Alternatives

Description:

... Intel: the real threat for processor designers is shipping 30 million CPUs only ... To date, most machines enforce such dependences in a rigorous fashion. ... – PowerPoint PPT presentation

Number of Views:142

Avg rating:3.0/5.0

Slides: 19

Provided by: unge

Category:

more less

Transcript and Presenter's Notes

Title: 12th Lecture: Technological Trends and Future Processor Alternatives

1
12th Lecture Technological Trends and Future
Processor Alternatives

Todays lecture
Microprocessors today
Trends and principles in the Giga Chip Era
Future Processor Alternatives

2
Microprocessors today (Y2K)

Chip-Technology 2000/01
0.18-?m CMOS-technology, 10 to 100 M transistors
per chip, 600 MHz to 1.4 GHz cycle rate
Example processors
Intel Pentium III 7.5 M transistors, 0.18 ?m
(Bi-)CMOS, up to 1 GHz
Intel Pentium 4 ?? Transistors, 0.18 ?m
(Bi-)CMOS, up to 1.4 GHz, uses a Trace Cache!
Intel IA-64 Itanium (already announced for
2000?) 0.18 ?m CMOS, 800 MHz successor
McKinley (announced for 2001) gt 1 GHz
Alpha 21364 100 M transistors, 0.18 ?m CMOS (1.5
volt, 100 watt), ? MHz
HAL SPARC uses Trace Cache and Value Prediction
Alpha 21464 will be 4-way simultaneous
multithreaded
Sun MAJC will be a two processor-chip, each
processor a 4-way block-multithreaded VLIW

3
5. Future processors to use fine-grain
parallelism5.1 Trends and principles in the Giga
Chip Era

Forecasting the effects of technology is hard
Everything that can be invented has been
invented. US Commissioner of Patents, 1899.
I think there is a world market for about five
computers. Thomas J. Watson Sr., IBM founder,
1943.

4
Microprocessors tomorrow (Y2K-2012)

Moore's Law number of transistors per chip
double every two years
SIA (semiconductor industries association)
prognosis 1998

5
Design Challenges

increasing clock speed,
the amount of work that can be performed per
cycle,
and the number of instructions needed to perform
a task.
Today's general trend toward more complex designs
is opposed by the wiring delay within the
processor chip as main technological problem.
higher clock rates with subquarter-micron
designs? on-chip interconnecting wires cause a
significant portion of the delay time in
circuits.
Especially global interconnects within a
processor chip cause problems with higher clock
rates.
Maintaining the integrity of a signal as it moves
from one end of the chip to the other becomes
more difficult.
Copper metallization is worth a 20 to 30
reduction in wiring delay.

6
Application and Economy-Related Trends

Applications
generally user interactions like video, audio,
voice recognition, speech processing, and 3D
graphics.
large data sets and huge databases,
large data mining applications,
transaction processing,
huge EDA applications like CAD/CAM software,
virtual reality computer games,
signal processing and real-time control.
Colwell from Intel the real threat for processor
designers is shipping 30 million CPUs only to
discover they are imperfect and cause a recall.
Economies of scale
Fabrication plants now cost about 2 billion, a
factor of ten more than a decade ago.
Manufacturers can only sustain such development
costs if larger markets with greater economies of
scale emerge. ? Workloads will concentrate on
human computer interface. ? Multimedia workloads
will grow and influence architectures.

7
Architectural Challenges and Implications

Preserve object code compatibility (may be
avoided by a virtual machine that targets
run-time ISAs)
It is necessary to find ways of expressing and
exposing more parallelism to the processor. It is
doubtful if enough ILP is available.
Buses may probably scale. Expect much wider buses
in future.
Memory bottleneck Memory latency may be solved
by a combination of technological improvements in
memory chip technology and by applying advanced
memory hierarchy techniques (other authors
disagree).
Power consumption for mobile computers and
appliances.
Soft errors by cosmic rays of gamma radiation may
be faced with fault-tolerant design through the
chip.

8
Possible solutions

a focus of processor chips on particular market
segments
multimedia pushes desktop personal computers
while high-end microprocessors will serve
specialized applications
integrate functionalities to systems on a chip
partition a microprocessor in a client chip part
that focuses on general user interaction enhanced
by server chip parts that are tailored for
special applications
a CPU core that works like a large ASIC block and
that allows system developers to instantiate
various devices on a chip with a simple CPU core
and reconfigurable on-chip parts that adapt to
application requirements.
Functional partitioning becomes more important!

9
Future Processor Architecture Principles

Speed-up of a single-threaded application -
today
Trace cache
Superspeculative
Advanced superscalar
Speed-up of multi-threaded applications lecture
13
Chip multiprocessors (CMPs)
Simultaneous multithreading
Speed-up of a single-threaded application by
multithreading lecture 14
Multiscalar processors
Trace processors
DataScalar
Exotics lecture 15
Processor-in-memory (PIM) or intelligent RAM
(IRAM)
Reconfigurable
Asynchronous

10
Processor Techniques that Speedup Single threaded
Applications

Trace Cache tries to fetch from dynamic
instruction sequences instead of the static code
cache
Advanced superscalar processors scale current
designs up to issue 16 or 32 instructions per
cycle.
Superspeculative processors enhance wide-issue
superscalar performance by speculating
aggressively at every point.

11
The Trace Cache

Trace cache is a new paradigm for caching
instructions.
A Trace cache is a special I-cache that captures
dynamic instruction sequences in contrast to the
I-cache that contains static instruction
sequences.
Like the I-cache, the trace cache is accessed
using the starting address of the next block of
instructions.
Unlike the I-cache, it stores logically
contiguous instructions in physically contiguous
storage.
A trace cache line stores a segment of the
dynamic instruction trace across multiple,
potentially taken branches.

12
The Trace Cache (2)

Each line stores a snapshot, or trace, of the
dynamic instruction stream.
The trace construction is of the critical path.
As a group of instructions is processed, it is
latched into the fill unit.
The fill unit maximizes the size of the segment
and finalizes a segment when the segment can be
expanded no further.
The number of instructions within a trace is
limited by the trace cache line size.
Finalized segments are written into the trace
cache.
Instructions can be sent from the trace cache
into the reservation stations (??) without having
to undergo a large amount of processing and
rerouting.
It is under research if the instructions in the
Trace cache are
fetched but not yet decoded
decoded but not yet renamed
or decoded and partly renamed
Trace cache placement in the microarchitecture is
dependent on this decision.

13
The Trace Cache- Performance

Three applications from the SPECint95 benchmarks
are simulated on a 16-wide issue machine with
perfect branch prediction
(see Patt-paper).

14
5.3 Superspeculative Processors

Idea Instructions generate many highly
predictable result values in real programs gt
Speculate on source operand values and begin
execution without waiting for result from the
previous instruction.Speculate about true data
dependences!!
reasons for the existence of value locality
Due to register spill code the reuse distance of
many shared values is very short in processor
cycles. Many stores do not even make it out of
the store queue before their values are needed
again.
Input sets often contain data with little
variation (e.g., sparse matrices or text files
with white spaces).
A compiler often generates run-time constants due
to error-checking, switch statement evaluation,
and virtual function calls.
The compiler also often loads program constants
from memory rather than using immediate
operands.
See M. H. Lipasti, J. P. Shen Superspeculative
Microarchitecture for Beyond AD 2000. IEEE
Computer, Sept. 1997, pp. 59-66

15
Strong- vs. Weak-dependence Model

Strong-dependence model for program execution a
total instruction ordering of a sequential
program.
Two instructions are identified as either
dependent or independent, and when in doubt,
dependences are pessimistically assumed to exist.
Dependences are never allowed to be violated and
are enforced during instruction processing.
To date, most machines enforce such dependences
in a rigorous fashion.
This traditional model is overly rigorous and
unnecessarily restricts available parallelism.
Weak-dependence model
specifying that dependences can be temporarily
violated during instruction execution as long as
recovery can be performed prior to affecting the
permanent machine state.
Advantage the machine can speculate aggressively
and temporarily violate the dependences. The
machine can exceed the performance limit imposed
by the strong-dependence model.

16
Implementation of a Weak-dependence Model

The front-end engine assumes the weak-dependence
model and is highly speculative, predicting
instructions to aggressively speculate past
them.
The back-end engine still uses the
strong-dependence model to validate the
speculations, recover from misspeculation, and
provide history and guidance information to the
speculative engine.

17
Superflow processor (Lipasti and Shen 1997)

The Superflow processor speculates on
instruction flow two-phase branch predictor
combined with trace cache
register data flow dependence prediction
predict the register value dependence between
instructions
source operand value prediction
constant value prediction
value stride prediction speculate on constant,
incremental increases in operand values
dependence prediction predicts inter-instruction
dependences
memory data flow prediction of load values, of
load addresses and alias prediction
Superflow simulations 7.3 IPC for SPEC95 integer
suite, up to 9 instructions per cycle when 32
instructions are potentially issued per cycle
With dependence and value prediction a three
cycle issue nearly matches the performance of a
single issue dispatch.