Title: Advanced Topic: High Performance Processors
1Advanced TopicHigh Performance Processors
2High Performance Processor Design Techniques
- Main Idea Exploit as much parallelism and hide
as much overhead as possible - Instruction Level Parallelism
- Scoreboarding
- Reservation Station (Tomasulo Algorithm)
- Dynamic Branch Prediction
- Speculation Architecture
- Multiple Instruction Issues (Super Scaler
Processor) - Vector Processors
- Digital Signal Processors
3Hardware Approach to Instruction Parallelism
- Why in hardware at run time?
- Works when cant know real dependence at compile
time - Compiler simpler
- Code for one machine runs well on another
- Key idea Allow instructions behind stall to
proceed - DIVD F0, F2, F4
- ADDD F10, F0, F8
- SUBD F12,F8,F14
- Enables out-of-order execution gt out-of-order
completion - ID stage checked both for structural and data
hazards
4Three Generic Data Hazards
- Many high performance processor have multiple
execution units and have multiple instructions
executed at the same time. These processors have
to handle three types of data hazards - Read After Write (RAW) InstrI followed by
InstrJ, InstrJ tries to read operand before
InstrI writes it - Write After Read (WAR) InstrI followed by
InstrJ, InstrJ tries to write operand before
InstrI reads it - Gets wrong operand
- Cant happen in the simple 5-stage pipeline
because - All instructions take 5 stages, and
- Reads are always in stage 2, and
- Writes are always in stage 5
- Write After Write (WAW) InstrI followed by
InstrJ, InstrJ tries to write operand before
InstrI writes it - Leaves wrong result ( InstrI not InstrJ )
- Cant happen in DLX 5 stage pipeline because
- All instructions take 5 stages, and
- Writes are always in stage 5
5Scoreboarding
- Scoreboard dates to CDC 6600 in 1963
- Out-of-order execution divides ID stage
- 1. Issuedecode instructions, check for
structural hazards - 2. Read operandswait until no data hazards, then
read operands - Scoreboards allow instruction to execute whenever
1 2 hold, not waiting for prior instructions - CDC 6600 In order issue, out of order execution,
out of order commit (also called completion)
6Scoreboard Implications
- Out-of-order completion gt WAR, WAW hazards?
- Solutions for WAR
- Queue both the operation and copies of its
operands - Read registers only during Read Operands stage
- For WAW, must detect hazard stall until other
completes - Need to have multiple instructions in execution
phase gt multiple execution units or pipelined
execution units - Scoreboard keeps track of dependencies, state or
operations - Scoreboard replaces ID, EX, WB with 4 stages
7Four Stages of Scoreboard Control
- 1. Issuedecode instructions check for
structural hazards (ID1) - If a functional unit for the instruction is
free and no other active instruction has the same
destination register (WAW), the scoreboard issues
the instruction to the functional unit and
updates its internal data structure. If a
structural or WAW hazard exists, then the
instruction issue stalls, and no further
instructions will issue until these hazards are
cleared. - 2. Read operandswait until no data hazards, then
read operands (ID2) - A source operand is available if no earlier
issued active instruction is going to write it,
or if the register containing the operand is
being written by a currently active functional
unit. When the source operands are available, the
scoreboard tells the functional unit to proceed
to read the operands from the registers and begin
execution. The scoreboard resolves RAW hazards
dynamically in this step, and instructions may be
sent into execution out of order.
8Four Stages of Scoreboard Control
- 3. Executionoperate on operands (EX)
- The functional unit begins execution upon
receiving operands. When the result is ready, it
notifies the scoreboard that it has completed
execution. - 4. Write resultfinish execution (WB)
- Once the scoreboard is aware that the
functional unit has completed execution, the
scoreboard checks for WAR hazards. If none, it
writes results. If WAR, then it stalls the
instruction. - Example
- DIVD F0,F2,F4
- ADDD F10,F0,F8
- SUBD F8,F8,F14
- CDC 6600 scoreboard would stall SUBD until ADDD
reads operands
9Three Parts of the Scoreboard
- 1. Instruction statuswhich of 4 steps the
instruction is in - 2. Functional unit statusIndicates the state of
the functional unit (FU). 9 fields for each
functional unit - BusyIndicates whether the unit is busy or not
- OpOperation to perform in the unit (e.g., or
) - FiDestination register
- Fj, FkSource-register numbers
- Qj, QkFunctional units producing source
registers Fj, Fk - Rj, RkFlags indicating when Fj, Fk are ready
- 3. Register result statusIndicates which
functional unit will write each register, if one
exists. Blank when no pending instructions will
write that register
10Detailed Scoreboard Pipeline Control
11Scoreboard Example
12Scoreboard Example Cycle 1
13Scoreboard Example Cycle 2
14Scoreboard Example Cycle 3
15Scoreboard Example Cycle 4
16Scoreboard Example Cycle 5
17Scoreboard Example Cycle 6
18Scoreboard Example Cycle 7
19Scoreboard Example Cycle 8a
20Scoreboard Example Cycle 8b
21Scoreboard Example Cycle 9
- Read operands for MULT SUBD? Issue ADDD?
22Scoreboard Example Cycle 11
23Scoreboard Example Cycle 13
24Scoreboard Example Cycle 14
25Scoreboard Example Cycle 15
26Scoreboard Example Cycle 16
27Scoreboard Example Cycle 17
28Scoreboard Example Cycle 18
29Scoreboard Example Cycle 20
30Scoreboard Example Cycle 21
31Scoreboard Example Cycle 22
32Scoreboard Example Cycle 61
33Scoreboard Example Cycle 62
34CDC 6600 Scoreboard
- Speedup 1.7 from compiler 2.5 by hand BUT slow
memory (no cache) limits benefit - Limitations of 6600 scoreboard
- No forwarding hardware
- Limited to instructions in basic block (small
window) - Small number of functional units (structural
hazards), especailly integer/load store units - Do not issue on structural hazards
- Wait for WAR hazards
- Prevent WAW hazards
35Another Dynamic Approach Tomasulo Algorithm
- For IBM 360/91 about 3 years after CDC 6600
(1966) - Goal High Performance without special compilers
- Differences between IBM 360 CDC 6600 ISA
- IBM has only 2 register specifiers/instr vs. 3 in
CDC 6600 - IBM has 4 FP registers vs. 8 in CDC 6600
- Why Study? lead to Alpha 21264, HP 8000, MIPS
10000, Pentium II, PowerPC 604,
36Tomasulo Algorithm vs. Scoreboard
- Control buffers distributed with Function Units
(FU) vs. centralized in scoreboard - FU buffers called reservation stations have
pending operands - Registers in instructions replaced by values or
pointers to reservation stations(RS) called
register renaming - avoids WAR, WAW hazards
- More reservation stations than registers, so can
do optimizations compilers cant - Results to FU from RS, not through registers,
over Common Data Bus that broadcasts results to
all FUs - Load and Stores treated as FUs with RSs as well
- Integer instructions can go past branches,
allowing FP ops beyond basic block in FP queue
37Tomasulo Organization
FPRegisters
FP Op Queue
LoadBuffer
StoreBuffer
CommonDataBus
FP AddRes.Station
FP MulRes.Station
38Reservation Station Components
- OpOperation to perform in the unit (e.g., or
) - Vj, VkValue of Source operands
- Store buffers has V field, result to be stored
- Qj, QkReservation stations producing source
registers (value to be written) - Note No ready flags as in Scoreboard Qj,Qk0 gt
ready - Store buffers only have Qi for RS producing
result - BusyIndicates reservation station or FU is
busy -
- Register result statusIndicates which
functional unit will write each register, if one
exists. Blank when no pending instructions that
will write that register.
39Three Stages of Tomasulo Algorithm
- 1. Issueget instruction from FP Op Queue
- If reservation station free (no structural
hazard), control issues instr sends operands
(renames registers). - 2. Executionoperate on operands (EX)
- When both operands ready then execute if not
ready, watch Common Data Bus for result - 3. Write resultfinish execution (WB)
- Write on Common Data Bus to all awaiting units
mark reservation station available - Normal data bus data destination (go
to bus) - Common data bus data source (come from bus)
- 64 bits of data 4 bits of Functional Unit
source address - Write if matches expected Functional Unit
(produces result) - Does the broadcast
40Tomasulo Example Cycle 0
41Tomasulo Example Cycle 1
Yes
42Tomasulo Example Cycle 2
Note Unlike 6600, can have multiple loads
outstanding
43Tomasulo Example Cycle 3
- Note registers names are removed (renamed) in
Reservation Stations MULT issued vs. scoreboard - Load1 completing what is waiting for Load1?
44Tomasulo Example Cycle 4
- Load2 completing what is waiting for it?
45Tomasulo Example Cycle 5
46Tomasulo Example Cycle 6
- Issue ADDD here vs. scoreboard?
47Tomasulo Example Cycle 7
- Add1 completing what is waiting for it?
48Tomasulo Example Cycle 8
49Tomasulo Example Cycle 9
50Tomasulo Example Cycle 10
- Add2 completing what is waiting for it?
51Tomasulo Example Cycle 11
- Write result of ADDD here vs. scoreboard?
52Tomasulo Example Cycle 12
- Note all quick instructions complete already
53Tomasulo Example Cycle 13
54Tomasulo Example Cycle 14
55Tomasulo Example Cycle 15
- Mult1 completing what is waiting for it?
56Tomasulo Example Cycle 16
- Note Just waiting for divide
57Tomasulo Example Cycle 55
58Tomasulo Example Cycle 56
- Mult 2 completing what is waiting for it?
59Tomasulo Example Cycle 57
- Again, in-oder issue, out-of-order execution,
completion
60Compare to Scoreboard Cycle 62
- Why takes longer on Scoreboard/6600?
61Tomasulo v. Scoreboard(IBM 360/91 v. CDC 6600)
- Pipelined Functional Units Multiple Functional
Units - (6 load, 3 store, 3 , 2 x/) (1 load/store, 1
, 2 x, 1 ) - window size 14 instructions 5 instructions
- No issue on structural hazard same
- WAR renaming avoids stall completion
- WAW renaming avoids stall completion
- Broadcast results from FU Write/read registers
- Control reservation stations central
scoreboard
62Tomasulo Drawbacks
- Complexity
- delays of 360/91, MIPS 10000, IBM 620?
- Many associative stores (CDB) at high speed
- Performance limited by Common Data Bus
- Multiple CDBs gt more FU logic for parallel assoc
stores
63Dynamic Branch Prediction
- Performance Æ’(accuracy, cost of misprediction)
- Branch History Table uses lower bits of PC
address index table of 1-bit values - Says whether or not branch taken last time
- No address check
- Problem in a loop, 1-bit BHT will cause two
mispredictions (avg is 9 iteratios before exit) - End of loop case, when it exits instead of
looping as before - First time through loop on next time through
code, when it predicts exit instead of looping
64Dynamic Branch Prediction
- Solution 2-bit scheme where change prediction
only if get misprediction twice (Figure 4.13, p.
264) - Red stop, not taken
- Green go, taken
65Branch History Table Accuracy
- Mispredict because either
- Wrong guess for that branch
- Got branch history of wrong branch when index the
table - 4096 entry table programs vary from 1
misprediction (nasa7, tomcatv) to 18 (eqntott),
with spice at 9 and gcc at 12 - 4096 about as good as infinite table(in Alpha
211164)
66Correlating Branches
- Hypothesis recent branches are correlated that
is, behavior of recently executed branches
affects prediction of current branch - Idea record m most recently executed branches as
taken or not taken, and use that pattern to
select the proper branch history table - In general, (m,n) predictor means record last m
branches to select between 2m history talbes each
with n-bit counters - Old 2-bit BHT is then a (0,2) predictor
67Correlating Branches
- (2,2) predictor
- Then behavior of recent branches selects between,
say, four predictions of next branch, updating
just that prediction
Prediction
68Selective History Predictor
8096 x 2 bits
1 0
Taken/Not Taken
11 10 01 00
Choose Non-correlator
Branch Addr
Choose Correlator
2
Global History
00
8K x 2 bit Selector
01
10
11
11 Taken 10 01 Not Taken 00
2048 x 4 x 2 bits
69Accuracy of Different Schemes(Figure 4.21, p.
272)
18
4096 Entries 2-bit BHT Unlimited Entries 2-bit
BHT 1024 Entries (2,2) BHT
Frequency of Mispredictions
0
70Need Address at Same Time as Prediction
- Branch Target Buffer (BTB) Address of branch
index to get prediction AND branch address (if
taken) - Note must check for branch match now, since
cant use wrong branch address (Figure 4.22, p.
273) - Return instruction addresses predicted with stack
Branch Prediction Taken or not Taken
Predicted PC
71Dynamic Branch Prediction Summary
- Branch History Table 2 bits for loop accuracy
- Correlation Recently executed branches
correlated with next branch - Branch Target Buffer include branch address
prediction - Predicated Execution can reduce number of
branches, number of mispredicted branches
72Speculation
- Speculation allow an instructionwithout any
consequences (including exceptions) if branch is
not actually taken (HW undo) called boosting - Combine branch prediction with dynamic scheduling
to execute before branches resolved - Separate speculative bypassing of results from
real bypassing of results - When instruction no longer speculative, write
boosted results (instruction commit)or discard
boosted results - execute out-of-order but commit in-order to
prevent irrevocable action (update state or
exception) until instruction commits
73Hardware Support for Speculation
- Need HW buffer for results of uncommitted
instructions reorder buffer - 3 fields instr, destination, value
- Reorder buffer can be operand source gt more
registers like RS - Use reorder buffer number instead of reservation
station when execution completes - Supplies operands between execution complete
commit - Once operand commits, result is put into
register - Instructionscommit
- As a result, its easy to undo speculated
instructions on mispredicted branches or on
exceptions
Reorder Buffer
FP Op Queue
FP Regs
Res Stations
Res Stations
FP Adder
FP Adder
74Four Steps of Speculative Tomasulo Algorithm
- 1. Issueget instruction from FP Op Queue
- If reservation station and reorder buffer slot
free, issue instr send operands reorder
buffer no. for destination (this stage sometimes
called dispatch) - 2. Executionoperate on operands (EX)
- When both operands ready then execute if not
ready, watch CDB for result when both in
reservation station, execute checks RAW
(sometimes called issue) - 3. Write resultfinish execution (WB)
- Write on Common Data Bus to all awaiting FUs
reorder buffer mark reservation station
available. - 4. Commitupdate register with reorder result
- When instr. at head of reorder buffer result
present, update register with result (or store to
memory) and remove instr from reorder buffer.
Mispredicted branch flushes reorder buffer
(sometimes called graduation)
75Renaming Registers
- Common variation of speculative design
- Reorder buffer keeps instruction information but
not the result - Extend register file with extra renaming
registers to hold speculative results - Rename register allocated at issue result into
rename register on execution complete rename
register into real register on commit - Operands read either from register file (real or
speculative) or via Common Data Bus - Advantage operands are always from single source
(extended register file)
76Issuing Multiple Instructions/Cycle
- Two variations
- Superscalar varying no. instructions/cycle (1 to
8), scheduled by compiler or by HW (Tomasulo) - IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000
- (Very) Long Instruction Words (V)LIW fixed
number of instructions (4-16) scheduled by the
compiler put ops into wide templates - Joint HP/Intel agreement in 1999/2000?
- Intel Architecture-64 (IA-64) 64-bit address
- Style Explicitly Parallel Instruction Computer
(EPIC) - Anticipated success lead to use of Instructions
Per Clock cycle (IPC) vs. CPI
77Issuing Multiple Instructions/Cycle
- Superscalar DLX 2 instructions, 1 FP 1
anything else - Fetch 64-bits/clock cycle Int on left, FP on
right - Can only issue 2nd instruction if 1st
instruction issues - More ports for FP registers to do FP load FP
op in a pair - Type Pipe Stages
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX MEM WB
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX MEM WB
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX MEM WB
- 1 cycle load delay expands to 3 instructions in
SS - instruction in right half cant use it, nor
instructions in next slot
78Loop Unrolling in Superscalar
- Integer instruction FP instruction Clock cycle
- Loop LD F0,0(R1) 1
- LD F6,-8(R1) 2
- LD F10,-16(R1) ADDD F4,F0,F2 3
- LD F14,-24(R1) ADDD F8,F6,F2 4
- LD F18,-32(R1) ADDD F12,F10,F2 5
- SD 0(R1),F4 ADDD F16,F14,F2 6
- SD -8(R1),F8 ADDD F20,F18,F2 7
- SD -16(R1),F12 8
- SD -24(R1),F16 9
- SUBI R1,R1,40 10
- BNEZ R1,LOOP 11
- SD -32(R1),F20 12
- Unrolled 5 times to avoid delays (1 due to SS)
- 12 clocks, or 2.4 clocks per iteration (1.5X)
79Multiple Issue Challenges
- While Integer/FP split is simple for the HW, get
CPI of 0.5 only for programs with - Exactly 50 FP operations
- No hazards
- If more instructions issue at same time, greater
difficulty of decode and issue - Even 2-scalar gt examine 2 opcodes, 6 register
specifiers, decide if 1 or 2 instructions can
issue - VLIW tradeoff instruction space for simple
decoding - The long instruction word has room for many
operations - By definition, all the operations the compiler
puts in the long instruction word are independent
gt execute in parallel - E.g., 2 integer operations, 2 FP ops, 2 Memory
refs, 1 branch - 16 to 24 bits per field gt 716 or 112 bits to
724 or 168 bits wide - Need compiling technique that schedules across
several branches
80Loop Unrolling in VLIW
- Memory Memory FP FP Int. op/ Clockreference
1 reference 2 operation 1 op. 2 branch - LD F0,0(R1) LD F6,-8(R1) 1
- LD F10,-16(R1) LD F14,-24(R1) 2
- LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD
F8,F6,F2 3 - LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
- ADDD F20,F18,F2 ADDD F24,F22,F2 5
- SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
- SD -16(R1),F12 SD -24(R1),F16 7
- SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,48 8
- SD -0(R1),F28 BNEZ R1,LOOP 9
- Unrolled 7 times to avoid delays
- 7 results in 9 clocks, or 1.3 clocks per
iteration (1.8X) - Average 2.5 ops per clock, 50 efficiency
- Note Need more registers in VLIW (15 vs. 6 in
SS)
81Trace Scheduling
- Parallelism across IF branches vs. LOOP branches
- Two steps
- Trace Selection
- Find likely sequence of basic blocks (trace) of
(statically predicted or profile predicted) long
sequence of straight-line code - Trace Compaction
- Squeeze trace into few VLIW instructions
- Need bookkeeping code in case prediction is wrong
- Compiler undoes bad guess (discards values in
registers) - Subtle compiler bugs mean wrong answer vs. pooer
performance no hardware interlocks
82Advantages of HW (Tomasulo) vs. SW (VLIW)
Speculation
- HW determines address conflicts
- HW better branch prediction
- HW maintains precise exception model
- HW does not execute bookkeeping instructions
- Works across multiple implementations
- SW speculation is much easier for HW design
83Superscalar v. VLIW
- Superscalar
- Smaller code size
- Binary compatability across generations of
hardware
- VLIW
- Simplified Hardware for decoding, issuing
instructions - No Interlock Hardware (compiler checks?)
- More registers, but simplified Hardware for
Register Ports (multiple independent register
files?)
84Dynamic Scheduling in Superscalar
- Dependencies stop instruction issue
- Code compiler for old version will run poorly on
newest version - May want code to vary depending on how
superscalar - How to issue two instructions and keep in-order
instruction issue for Tomasulo? - Assume 1 integer 1 floating point
- 1 Tomasulo control for integer, 1 for floating
point - Issue 2X Clock Rate, so that issue remains in
order - Only FP loads might cause dependency between
integer and FP issue - Replace load reservation station with a load
queue operands must be read in the order they
are fetched - Load checks addresses in Store Queue to avoid RAW
violation - Store checks addresses in Load Queue to avoid
WAR,WAW - Called decoupled architecture
85Performance of Dynamic Superscaler Scheduling
- Iteration Instructions Issues Executes Writes
result - no.
clock-cycle number - 1 LD F0,0(R1) 1 2 4
- 1 ADDD F4,F0,F2 1 5 8
- 1 SD 0(R1),F4 2 9
- 1 SUBI R1,R1,8 3 4 5
- 1 BNEZ R1,LOOP 4 5
- 2 LD F0,0(R1) 5 6 8
- 2 ADDD F4,F0,F2 5 9 12
- 2 SD 0(R1),F4 6 13
- 2 SUBI R1,R1,8 7 8 9
- 2 BNEZ R1,LOOP 8 9
- 4 clocks per iteration only 1 FP
instr/iteration - Branches, Decrements issues still take 1 clock
cycle - How to get more performance?
86Software Pipelining
- Observation if iterations from loops are
independent, then can get more ILP by taking
instructions from different iterations - Software pipelining reorganizes loops so that
each iteration is made from instructions chosen
from different iterations of the original loop (
Tomasulo in SW)
87Software Pipelining Example
- Before Unrolled 3 times
- 1 LD F0,0(R1)
- 2 ADDD F4,F0,F2
- 3 SD 0(R1),F4
- 4 LD F6,-8(R1)
- 5 ADDD F8,F6,F2
- 6 SD -8(R1),F8
- 7 LD F10,-16(R1)
- 8 ADDD F12,F10,F2
- 9 SD -16(R1),F12
- 10 SUBI R1,R1,24
- 11 BNEZ R1,LOOP
After Software Pipelined 1 SD 0(R1),F4 Stores
Mi 2 ADDD F4,F0,F2 Adds to Mi-1
3 LD F0,-16(R1) Loads Mi-2 4 SUBI R1,R1,8
5 BNEZ R1,LOOP
SW Pipeline
overlapped ops
Time
Loop Unrolled
- Symbolic Loop Unrolling
- Maximize result-use distance
- Less code space than unrolling
- Fill drain pipe only once per loop vs.
once per each unrolled iteration in loop unrolling
Time
88Limits to Multi-Issue Machines
- Inherent limitations of ILP
- 1 branch in 5 How to keep a 5-way VLIW busy?
- Latencies of units many operations must be
scheduled - Need about Pipeline Depth x No. Functional Units
of independentDifficulties in building HW - Easy More instruction bandwidth
- Easy Duplicate FUs to get parallel execution
- Hard Increase ports to Register File (bandwidth)
- VLIW example needs 7 read and 3 write for Int.
Reg. 5 read and 3 write for FP reg - Harder Increase ports to memory (bandwidth)
- Decoding Superscalar and impact on clock rate,
pipeline depth?
89Limits to Multi-Issue Machines
- Limitations specific to either Superscalar or
VLIW implementation - Decode issue in Superscalar how wide practical?
- VLIW code size unroll loops wasted fields in
VLIW - IA-64 compresses dependent instructions, but
still larger - VLIW lock step gt 1 hazard all instructions
stall - IA-64 not lock step? Dynamic pipeline?
- VLIW binary compatibilityIA-64 promises binary
compatibility
90Limits to ILP
- Conflicting studies of amount
- Benchmarks (vectorized Fortran FP vs. integer C
programs) - Hardware sophistication
- Compiler sophistication
- How much ILP is available using existing
mechanims with increasing HW budgets? - Do we need to invent new HW/SW mechanisms to keep
on processor performance curve?
91Limits to ILP
- Initial HW Model here MIPS compilers.
- Assumptions for ideal/perfect machine to start
- 1. Register renaminginfinite virtual registers
and all WAW WAR hazards are avoided - 2. Branch predictionperfect no mispredictions
- 3. Jump predictionall jumps perfectly predicted
gt machine with perfect speculation an
unbounded buffer of instructions available - 4. Memory-address alias analysisaddresses are
known a store can be moved before a load
provided addresses not equal - 1 cycle latency for all instructions unlimited
number of instructions issued per clock cycle
92Intel/HP Explicitly Parallel Instruction
Computer (EPIC)
- 3 Instructions in 128 bit groups field
determines if instructions dependent or
independent - Smaller code size than old VLIW, larger than
x86/RISC - Groups can be linked to show independence gt 3
instr - 64 integer registers 64 floating point
registers - Not separate filesper funcitonal unit as in old
VLIW - Hardware checks dependencies (interlocks gt
binary compatibility over time) - Predicated execution (select 1 out of 64 1-bit
flags) gt 40 fewer mispredictions? - IA-64 name of instruction set architecture
EPIC is type - Merced is name of first implementation
(1999/2000?) - LIW EPIC?
93Dynamic Scheduling in PowerPC 604 and Pentium Pro
- Both In-order Issue, Out-of-order execution,
In-order Commit -
- Pentium Pro more like a scoreboard since central
control vs. distributed
94Dynamic Scheduling in PowerPC 604 and Pentium Pro
- Parameter PPC PPro
- Max. instructions issued/clock 4 3
- Max. instr. complete exec./clock 6 5
- Max. instr. commited/clock 6 3
- Window (Instrs in reorder buffer) 16 40
- Number of reservations stations 12 20
- Number of rename registers 8int/12FP 40
- No. integer functional units (FUs) 2 2No.
floating point FUs 1 1 No. branch FUs 1 1 No.
complex integer FUs 1 0No. memory FUs 1 1 load
1 store
Q How pipeline 1 to 17 byte x86 instructions?
95Dynamic Scheduling in Pentium Pro
- PPro doesnt pipeline 80x86 instructions
- PPro decode unit translates the Intel
instructions into 72-bit micro-operations ( DLX) - Sends micro-operations to reorder buffer
reservation stations - Takes 1 clock cycle to determine length of 80x86
instructions 2 more to create the
micro-operations - 12-14 clocks in total pipeline ( 3 state
machines) - Many instructions translate to 1 to 4
micro-operations - Complex 80x86 instructions are executed by a
conventional microprogram (8K x 72 bits) that
issues long sequences of micro-operations
96Problems with Instruction Level Parallelism
- Limits to conventional exploitation of ILP
- 1) pipelined clock rate at some point, each
increase in clock rate has corresponding CPI
increase (branches, other hazards) - 2) instruction fetch and decode at some point,
its hard to fetch and decode more instructions
per clock cycle - 3) cache hit rate some long-running
(scientific) programs have very large data sets
accessed with poor locality others have
continuous data streams (multimedia) and hence
poor locality
97Alternative Model Vector Processing
- Vector processors have high-level operations that
work on linear arrays of numbers "vectors"
SCALAR (1 operation)
VECTOR (N operations)
add.vv v3, v1, v2
add r3, r1, r2
98Properties of Vector Processors
- Each result independent of previous result gt
long pipeline, compiler ensures no
dependenciesgt high clock rate - Vector instructions access memory with known
patterngt highly interleaved memory gt amortize
memory latency of over 64 elements gt no
(data) caches required! (Do use instruction
cache) - Reduces branches and branch problems in pipelines
- Single vector instruction implies lots of work (
loop) gt fewer instruction fetches
99Styles of Vector Architectures
- memory-memory vector processors all vector
operations are memory to memory - vector-register processors all vector operations
between vector registers (except load and store) - Vector equivalent of load-store architectures
- Includes all vector machines since late 1980s
Cray, Convex, Fujitsu, Hitachi, NEC - We assume vector-register for rest of lectures
100Components of Vector Processor
- Vector Register fixed length bank holding a
single vector - has at least 2 read and 1 write ports
- typically 8-32 vector registers, each holding
64-128 64-bit elements - Vector Functional Units (FUs) fully pipelined,
start new operation every clock - typically 4 to 8 FUs FP add, FP mult, FP
reciprocal (1/X), integer add, logical, shift
may have multiple of same unit - Vector Load-Store Units (LSUs) fully pipelined
unit to load or store a vector may have multiple
LSUs - Scalar registers single element for FP scalar or
address - Cross-bar to connect FUs , LSUs, registers
101Memory operations
- Load/store operations move groups of data between
registers and memory - Three types of addressing
- Unit stride
- Fastest
- Non-unit (constant) stride
- Indexed (gather-scatter)
- Vector equivalent of register indirect
- Good for sparse arrays of data
- Increases number of programs that vectorize
32
102Example of Vector Instruction (Y a X Y)
Assuming vectors X, Y are length 64 Scalar vs.
Vector
LD F0,a load scalar a LV V1,Rx load
vector X MULTS V2,F0,V1 vector-scalar
mult. LV V3,Ry load vector Y ADDV V4,V2,V3 add
SV Ry,V4 store the result
- LD F0,a
- ADDI R4,Rx,512 last address to load
- loop LD F2, 0(Rx) load X(i)
- MULTD F2,F0,F2 aX(i)
- LD F4, 0(Ry) load Y(i)
- ADDD F4,F2, F4 aX(i) Y(i)
- SD F4 ,0(Ry) store into Y(i)
- ADDI Rx,Rx,8 increment index to X
- ADDI Ry,Ry,8 increment index to Y
- SUB R20,R4,Rx compute bound
- BNZ R20,loop check if done
578 (2964) vs.321 (1564) ops (1.8X) 578
(2964) vs. 6 instructions (96X) 64
operation vectors no loop overhead also
64X fewer pipeline hazards
103Vector Surprise
- Use vectors for inner loop parallelism (no
surprise) - One dimension of array A0, 0, A0, 1, A0,
2, ... - think of machine as, say, 32 vector regs each
with 64 elements - 1 instruction updates 64 elements of 1 vector
register - and for outer loop parallelism!
- 1 element from each column A0,0, A1,0,
A2,0, ... - think of machine as 64 virtual processors (VPs)
each with 32 scalar registers! ( multithreaded
processor) - 1 instruction updates 1 scalar register in 64 VPs
- Hardware identical, just 2 compiler perspectives
104Virtial Processor Vector Model
- Vector operations are SIMD (single instruction
multiple data)operations - Each element is computed by a virtual processor
(VP) - Number of VPs given by vector length
- vector control register
105Vector Architectural State
106Vector Implementation
- Vector register file
- Each register is an array of elements
- Size of each register determines maximumvector
length - Vector length register determines vector
lengthfor a particular operation - Multiple parallel execution units
lanes(sometimes called pipelines or pipes)
107Vector Terminology 4 lanes, 2 vector functional
units
(Vector Functional Unit)
108Vector Execution Time
- Time f(vector length, data dependicies, struct.
hazards) - Initiation rate rate that FU consumes vector
elements ( number of lanes usually 1 or 2 on
Cray T-90) - Convoy set of vector instructions that can begin
execution in same clock (no struct. or data
hazards) - Chime approx. time for a vector operation
- m convoys take m chimes if each vector length is
n, then they take approx. m x n clock cycles
(ignores overhead good approximization for long
vectors)
4 conveys, 1 lane, VL64 gt 4 x 64 256
clocks (or 4 clocks per result)
109Example of Vector Instruction Start-up Time
- Start-up time pipeline latency time (depth of FU
pipeline) another sources of overhead - Operation Start-up penalty (from CRAY-1)
- Vector load/store 12
- Vector multply 7
- Vector add 6
- Assume convoys don't overlap vector length n
Convoy Start 1st result last result 1. LV
0 12 11n (12n-1) 2. MULV,
LV 12n 12n12 232n Load start-up 3.
ADDV 242n 242n6 293n Wait convoy 2 4. SV
303n 303n12 414n Wait convoy 3
110Why startup time for each vector instruction?
- Why not overlap startup time of back-to-back
vector instructions? - Cray machines built from many ECL chips operating
at high clock rates hard to do? - Berkeley vector design (T0) didnt know it
wasnt supposed to do overlap, so no startup
times for functional units (except load)
111Vector Load/Store Units Memories
- Start-up overheads usually longer fo LSUs
- Memory system must sustain ( lanes x word)
/clock cycle - Many Vector Procs. use banks (vs. simple
interleaving) - 1) support multiple loads/stores per cycle gt
multiple banks address banks independently - 2) support non-sequential accesses (see soon)
- Note No. memory banks gt memory latency to avoid
stalls - m banks gt m words per memory lantecy l clocks
- if m lt l, then gap in memory pipeline
- clock 0 l l1 l2 lm- 1 lm 2 l
- word -- 0 1 2 m-1 -- m
- may have 1024 banks in SRAM
112Vector Length
- What to do when vector length is not exactly 64?
- vector-length register (VLR) controls the length
of any vector operation, including a vector load
or store. (cannot be gt the length of vector
registers) - do 10 i 1, n
- 10 Y(i) a X(i) Y(i)
- Don't know n until runtime! n gt Max. Vector
Length (MVL)?
113Strip Mining
- Suppose Vector Length gt Max. Vector Length (MVL)?
- Strip mining generation of code such that each
vector operation is done for a size Å to the MVL - 1st loop do short piece (n mod MVL), rest VL
MVL - low 1 VL (n mod MVL) /find the odd
size piece/ do 1 j 0,(n / MVL) /outer
loop/ - do 10 i low,lowVL-1 /runs for length
VL/ Y(i) aX(i) Y(i) /main
operation/10 continue low lowVL /start of
next vector/ VL MVL /reset the length to
max/1 continue
114Common Vector Metrics