Title: High Performance Computer
1High Performance Computer Architecture
Challenges Rajeev Balasubramonian School of
Computing, University of Utah
2Dramatic Clock Speed Improvements!!
Intel Pentium 4 3.2 GHz
The 1st Intel processor 108 KHz
3Clock Speed Performance ?
- The Intel Pentium4 has a higher clock speed
- than the IBM Power4 does the Pentium4
- execute your program faster?
4Clock Speed Performance ?
- The Intel Pentium4 has a higher clock speed
- than the IBM Power4 does the Pentium4
- execute your program faster?
Case 1
Completing instruction
Clock tick
Case 2
Time
5Performance Clock Speed x Parallelism
6What About Parallelism?
7Dramatic Clock Speed Improvements!!
Intel Pentium 4 3.2 GHz
The 1st Intel processor 108 KHz
8The Basic Pipeline
Consider an automobile assembly line
A new car rolls out every day
Stage 1
Stage 2
Stage 3
Stage 4
1 day
1 day
1 day
1 day
A new car rolls out every half day
In each case, it takes 4 days to build a car,
but More stages ? more parallelism and less
time between cars
9What Determines Clock Speed?
- Clock speed is a function of work done in each
- stage in the earlier examples, the clock
speeds - were 1 car/day and 2 cars/day
- Similarly, it takes plenty of work to execute
an - instruction and this work is broken into stages
Execution of a single instruction
10What Determines Clock Speed?
- Clock speed is a function of work done in each
- stage in the earlier examples, the clock
speeds - were 1 car/day and 2 cars/day
- Similarly, it takes plenty of work to execute
an - instruction and this work is broken into stages
250ps ? 4GHz clock speed
Execution of a single instruction
11Clock Speed Improvements
- Why have we seen such dramatic improvements
- in clock speed?
- work has been broken up into more stages
- early Intel chips executed work equivalent
- to approximately 56 logic gates todays
- chips execute 12 logic gates worth of work
- transistors have been becoming faster
- as technology improves, we can draw
- smaller and smaller transistors/gates on a
- chip and that improves their speed
- (doubles every 5-6 years)
12Will these Improvements Continue?
- Transistors will continue to shrink and become
- faster for at least 10 more years
- Each pipeline stage is already pretty small
- improvements from this factor will cease
- If clock speed improvements stagnate, should
- we turn our focus to parallelism?
13Microprocessor Blocks
Branch Predictor
L1 Instr Cache
Decode Rename
Issue Logic
ALU
ALU
ALU
ALU
L2 Cache
L1 Data Cache
Register File
14Innovations Branch Predictor
Improve prediction accuracy by detecting frequent
patterns
Branch Predictor
L1 Instr Cache
Decode Rename
Issue Logic
ALU
ALU
ALU
ALU
L2 Cache
L1 Data Cache
Register File
15Innovations Out-of-order Issue
Out-of-order issue if later instructions do not
depend on earlier ones, execute them first
Branch Predictor
L1 Instr Cache
Decode Rename
Issue Logic
ALU
ALU
ALU
ALU
L2 Cache
L1 Data Cache
Register File
16Innovations Superscalar Architectures
Multiple ALUs increase execution bandwidth
Branch Predictor
L1 Instr Cache
Decode Rename
Issue Logic
ALU
L2 Cache
L1 Data Cache
ALU
ALU
ALU
Register File
17Innovations Data Caches
2K papers on caches efficient data layout,
stride prefetching
Branch Predictor
L1 Instr Cache
Decode Rename
Issue Logic
ALU
ALU
ALU
ALU
L2 Cache
L1 Data Cache
Register File
18Summary
- Historically, computer engineers have focused on
- performance
- Performance is a function of clock speed and
- parallelism
- As technology improves, clock speeds will
- improve, although at a slower rate
- Parallelism has been gradually improving and
- plenty of low-hanging fruits have been picked
19Outline
- Recent Microprocessor History
- Current Trends and Challenges
- Solutions to Handling these Challenges
20Trend I An Opportunity
- Transistors on a chip have been doubling every
- two years (Moores Law)
- In the past, transistors have been used for
- out-of-order logic, large caches, etc
- In the future, transistors can be employed for
- multiple processors on a single chip
21Chip Multiprocessors (CMP)
- The IBM Power4 has two processors on a die
- Sun has announced the 8-processor Niagara
P1
P2
P3
P4
L2 cache
22The Challenge
- Nearly every chip will have multiple processors,
- but where are the threads?
- Some applications will truly benefit they can
be - easily decomposed into threads
- Some applications are inherently sequential
can - we execute speculative threads to speed up
these - programs? (open problem!)
23Trend II Power Consumption
- Power a a f C V2 , where a is activity factor,
- f is frequency, C is capacitance, and V is
voltage - Every new chip has higher frequency, more
- transistors (higher C), and slightly lower
voltage - the net result is an increase in power
consumption
24Scary Slide!
- Power density cannot be allowed to increase at
- current rates (Source Borkar et al., Intel)
25Impact of Power Increases
- Well, UtahPower sends you fatter bills every
month - To maintain constant chip temperature, heat
- produced on a chip has to be dissipated away
- every additional watt increases cooling cost of
a - chip by approximately 4 !!
- If temperature of a chip rises, the power
dissipated - also increases (almost exponentially) ? a
vicious - cycle!
26Trend III Wire Delays
- As technology improves, logic gates shrink ?
- their speed increases and clock speeds improve
- As logic gates shrink, wires shrink too
- unfortunately, their speed improves only
- marginally
- In relative terms, future chips will have fast
- transistors/gates and slow wires
- Computation is cheap, communication is expensive!
27Impact of Wire Delays
- Crossing the chip used to take one cycle
- In the future, crossing the chip can take up to
30 - cycles
- Many structures on a chip are wire-constrained
- (register file, cache) their access times
slow - down ? throughput decreases as instructions
- sit around waiting for values
- Long wires also consume power
28Trend IV Soft Errors
- High energy particles constantly collide with
- objects and deposit charge
- Transistors are becoming smaller and on-chip
- voltages are being lowered ? it doesnt take
much - to toggle the state of the transistor
- The frequency of this occurrence is projected to
- increase by nine orders of magnitude over a 20
- year period
29Impact of Soft Errors
- When a particle strike occurs, the component is
- not rendered permanently faulty only the
value - it contains is erroneous
- Hence, this is termed a transient fault or soft
error - The error propagates when other instructions
read - this faulty value
- This is already a problem for mission-critical
apps - (space, defense, highly-available servers) and
may - soon be a problem in other domains
30Summary of Trends
- More transistors, more processors on a single
chip - High power consumption
- Long wire delays
- Frequent soft errors
- We are attempting to exploit transistors to
increase - parallelism in light of the above challenges,
wed - be happy to even preserve parallelism
31Transistors Wire Delays
- Bring in a large window of instructions so you
- can find high parallelism
- Distribute instructions across processors so
that - communication is minimized
Instructions
Processors
32Difficult Branches
- Mispredicted branches result in poor parallelism
- and wasted work (power)
- Solution when you arrive at a fork, take both
- directions execute on low frequency units to
- control power dissipation levels
Instructions
Processors
33Thermal Emergencies
- Heterogeneous units allow you to reduce cooling
- costs
- If a chips peak power is 110W, allow enough
- cooling to handle 100W average save 40/chip!
- If the application starts consuming more than
- 100W and temperature starts to rise, start
- favoring the low power processor cores
- intelligent management allows you to make
- forward progress even in a thermal emergency
34Handling Long Wire Delays
- Wires can be designed to have different
properties - Knob 1 wire width and spacing fat wires are
- faster, but have low bandwidth
35Handling Wire Capacitance
- Knob 2 wires have repeaters/buffers many,
- large buffers ? low delay, high power
consumption
36Mapping Data to Wires
- We can optimize wires for delay, bandwidth,
power - Different data transfers on a chip have
different - latency and bandwidth needs an intelligent
- mapping of data to wires can improve
performance - and lower power consumption
37Handling Soft Errors
- Errors can be detected and corrected by
providing - redundancy execute two copies of a program
- (perhaps, on a CMP) and compare results
- Note that this doubles power consumption!
Leading Thread
Trailing Thread
38Handling Soft Errors
- Trailing thread is capable of higher performance
- than leading thread but theres no point
catching - up hence, artificially slow the trailing
thread by - lowering its frequency ? lower power dissipation
Peak thruput 1 BIPS 2 BIPS
Trailing thread never fetches data from memory
and never guesses at branches
Leading Thread
Trailing Thread
39Summary of Solutions
- Heterogeneous wires and processors
- Instructions and data have different needs map
- them to appropriate wires and processors
- Note how these solutions target multiple issues
- simultaneously slow wires, many transistors,
- soft errors, power/thermal emergencies
40Conclusions
- Performance has improved because of clock
- speed and parallelism advances
- Clock speed improvements will continue at a
- slower rate
- Parallelism is on a downward trend because of
- technology trends and because low-hanging
- fruits have been picked
- We must find creative ways to preserve or even
- improve parallelism in the future
41(No Transcript)
42Slide Title