Title: OMSE 510: Computing Foundations 4: The CPU
1OMSE 510 Computing Foundations4 The CPU!
- Chris Gilmore ltgrimjack_at_cs.pdx.edugt
- Systems Software Lab
- Portland State University/OMSE
2Today
- Caches
- DLX Assembly
- CPU Overview
3Introduction to RISC
- Reduced Instruction Set Computer
- 1975 John Cocke IBM 801
- IBM started working on a RISC-type computer on
1975 without calling it by this name - used as an I/O processor for IBM Mainframe
- Patterson and Hennessey
- RISC was first introduction by Patterson and
Ditzel in1980 - Produced first RISC chip in early 1980s
- RISC I and RISC II from Berkeley and MIPS from
Stanford
4RISC Chips
- RISC II
- Had 39 instructions and 2 addressing modes, 3
data types - 234 combinations
- Compared to VAX 304 inst, 16 address mode, 14
data type - 68,096
- Found that
- Compiled programs were 30 larger than CISC (Vax
11/780) - Ran upto 5 times faster than 68000
- Assembler-Compiler ratio (Execution time of
assembler program divided by the exec time of
compiled version) - Ratio lt 50 for CISC
- 90 for RISC
5RISC Definition
- 1. Single cycle operation
- 2. Load / store design
- 3. Hardwired control unit
- 4. Few instructions and addressing modes
- 5. Fixed instruction format
- 6. More compile time effort to avoid pipeline
penalties
6Disadvantages of CISC
- Large, complicated, and time-consuming
instruction set - Complex CU to decode and execute
- Not necessarily faster than a sequence of several
RISC instructions - Complexity of the CISC CU
- A large number of design errors
- Longer design time
- Too large a choice for the compiler
- Very difficult to design the optimal compiler
- Not always yield the most efficient code
- Specialized to fit certain HLL instruction
- May be redundant for another HLL
- Relatively low cost/benefit factor
7The Advantage of RISC
- RISC and VLSI realization
- Relatively small and simple C.U. hardware
- RISC I 6 RISC II 10
MC68020 68 - Higher chance of fitting other features on a chip
- Can fit a large number of CPU registers
- Enhances the throughput and HLL support
- Increase the regularization factor
8The Advantage of RISC
- RISC and Computing Speed
- Faster decoding process
- Small instruction set, addressing mode, fixed
instruction format - Reduce Memory access.
- A large number of CPU registers permits R-R
operations - Faster Parameter passing
- Register windows in RISC I and RISC II
- streamlined instruction handing
- All instruction have the same length
- All execute in one cycle
- Suitable for the pipelined implementation
9The Advantage of RISC
- RISC and design costs and reliability
- Shorter time to design and reduction of overall
design costs - Reduce the probability that the end product will
be obsolete - Reduced number of design errors
- Virtual Memory Management System enhancement
- inst will not cross word boundaries and cant
wind up on two separate pages
10The Advantage of RISC
- RISC and HLL Support
- Shorter and simpler compiler
- Usually only a single choice rather than several
choice in CISC - Large Number of CPU registers
- More efficient code optimization
- Fast Parameter Passing between procedures
- register windows
- Reduced burden on compiler writer
11The Disadvantage and Criticism of RISC(80s)
- RISC code to be longer
- Extra burden on the machine and assembly language
programmer - Several instructions required per a single CISC
instruction - More Memory Locations for their storage
- Floating Point Support and VMM support
12RISC Characteristics
- Pipelined operation
- Compiler responsible for pipeline conflict
resolution - Delayed branch
- Delayed load
13Question 1 Why do microcoding?
- If simple instruction could execute at very high
clock rate - If you could even write compilers to produce
microinstructions - If most programs use simple instructions and
addressing modes - If microcode is kept in RAM instead of ROM so as
to fix bugs - If same memory used for control memory could be
used instead as cache for macroinstructions - Then why not skip instruction interpretation by a
microprogram and simply compile directly into
lowest language of machine? (microprogramming is
overkill when ISA matches datapath 1-1)
14Pipelining is Natural!
- Laundry Example
- Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold - Washer takes 30 minutes
- Dryer takes 40 minutes
- Folder takes 20 minutes
15Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
- Sequential laundry takes 6 hours for 4 loads
- If they learned pipelining, how long would
laundry take?
16Pipelined Laundry Start work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
- Pipelined laundry takes 3.5 hours for 4 loads
17Pipelining Lessons
- Pipelining doesnt help latency of single task,
it helps throughput of entire workload - Pipeline rate limited by slowest pipeline stage
- Multiple tasks operating simultaneously using
different resources - Potential speedup Number pipe stages
- Unbalanced lengths of pipe stages reduces speedup
- Time to fill pipeline and time to drain it
reduces speedup - Stall for Dependences
6 PM
7
8
9
Time
T a s k O r d e r
18Execution Cycle
Obtain instruction from program storage
Determine required actions and instruction size
Locate and obtain operand data
Compute result value or status
Deposit results in storage for later use
Determine successor instruction
19The Five Stages of Load
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Load
- Ifetch Instruction Fetch
- Fetch the instruction from the Instruction Memory
- Reg/Dec Registers Fetch and Instruction Decode
- Exec Calculate the memory address
- Mem Read the data from the Data Memory
- Wr Write the data back to the register file
20Note These 5 stages were there all along!
Fetch
Decode
Execute
Memory
Write-back
21Pipelining
- Improve performance by increasing throughput
-
- Ideal speedup is number of stages in the
pipeline. Do we achieve this?
22Basic Idea
- What do we need to add to split the datapath into
stages?
23Graphically Representing Pipelines
- Can help with answering questions like
- how many cycles does it take to execute this
code? - what is the ALU doing during cycle 4?
- use this representation to help understand
datapaths
24Conventional Pipelined Execution Representation
Time
Program Flow
25Single Cycle, Multiple Cycle, vs. Pipeline
Cycle 1
Cycle 2
Clk
Single Cycle Implementation
Load
Store
Waste
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Clk
Multiple Cycle Implementation
Load
Store
R-type
Pipeline Implementation
Load
Store
R-type
26Why Pipeline?
- Suppose we execute 100 instructions
- Single Cycle Machine
- 45 ns/cycle x 1 CPI x 100 inst 4500 ns
- Multicycle Machine
- 10 ns/cycle x 4.6 CPI (due to inst mix) x 100
inst 4600 ns - Ideal pipelined machine
- 10 ns/cycle x (1 CPI x 100 inst 4 cycle drain)
1040 ns
27Why Pipeline? Because we can!
Time (clock cycles)
I n s t r. O r d e r
Inst 0
Inst 1
Inst 2
Inst 3
Inst 4
28Can pipelining get us into trouble?
- Yes Pipeline Hazards
- structural hazards attempt to use the same
resource two different ways at the same time - E.g., combined washer/dryer would be a structural
hazard or folder busy doing something else
(watching TV) - control hazards attempt to make a decision
before condition is evaluated - E.g., washing football uniforms and need to get
proper detergent level need to see after dryer
before next load in - branch instructions
- data hazards attempt to use item before it is
ready - E.g., one sock of pair in dryer and one in
washer cant fold until get sock from washer
through dryer - instruction depends on result of prior
instruction still in the pipeline - Can always resolve hazards by waiting
- pipeline control must detect the hazard
- take action (or delay action) to resolve hazards
29Single Memory is a Structural Hazard
Time (clock cycles)
I n s t r. O r d e r
Load
Mem
Reg
Reg
Instr 1
Instr 2
Mem
Mem
Reg
Reg
Instr 3
Instr 4
Detection is easy in this case! (right half
highlight means read, left half write)
30Structural Hazards limit performance
- Example if 1.3 memory accesses per instruction
and only one memory access per cycle then - average CPI ? 1.3
- otherwise resource is more than 100 utilized
31Control Hazard Solution 1 Stall
- Stall wait until decision is clear
- Impact 2 lost cycles (i.e. 3 clock cycles per
branch instruction) gt slow - Move decision to end of decode
- save 1 cycle per branch
32Control Hazard Solution 2 Predict
- Predict guess one direction then back up if
wrong - Impact 0 lost cycles per branch instruction if
right, 1 if wrong (right 50 of time) - Need to Squash and restart following
instruction if wrong - Produce CPI on branch of (1 .5 2 .5) 1.5
- Total CPI might then be 1.5 .2 1 .8 1.1
(20 branch) - More dynamic scheme history of 1 branch ( 90)
33Control Hazard Solution 3 Delayed Branch
- Delayed Branch Redefine branch behavior (takes
place after next instruction) - Impact 0 clock cycles per branch instruction if
can find instruction to put in slot ( 50 of
time) - As we launch more instruction per clock cycle,
less useful
34Delayed/Predicted Branch
- Where to get instructions to fill branch delay
slot? - Before branch instruction
- From the target address only valuable when
branch taken - From fall through only valuable when branch not
taken - Cancelling branches allow more slots to be
filled - Compiler effectiveness for single branch delay
slot - Fills about 60 of branch delay slots
- About 80 of instructions executed in branch
delay slots useful in computation - About 50 (60 x 80) of slots usefully filled
- Delayed Branch downside 7-8 stage pipelines,
multiple instructions issued per clock
(superscalar)
35Data Hazard on r1
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
36Data Hazard on r1
- Dependencies backwards in time are hazards
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
Reg
or r8,r1,r9
ALU
xor r10,r1,r11
37Data Hazard Solution
- Forward result from one stage to another
-
- or OK if define read/write properly
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
Reg
or r8,r1,r9
ALU
xor r10,r1,r11
38Forwarding (or Bypassing) What about Loads?
- Dependencies backwards in time are
hazards - Cant solve with forwarding
- Must delay/stall instruction dependent on loads
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
lw r1,0(r2)
Reg
Reg
ALU
Im
Dm
sub r4,r1,r3
Dm
Reg
Reg
39Forwarding (or Bypassing) What about Loads
- Dependencies backwards in time are
hazards - Cant solve with forwarding
- Must delay/stall instruction dependent on loads
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
lw r1,0(r2)
Reg
Reg
ALU
Im
Dm
Stall
sub r4,r1,r3
40Detecting Control Signals
41Conflicts/Problems
- I-cache and D-cache are accessed in the same
cycle it - helps to implement them separately
- Registers are read and written in the same cycle
easy to - deal with if register read/write time equals
cycle time/2 - (else, use bypassing)
- Branch target changes only at the end of the
second stage - -- what do you do in the meantime?
- Data between stages get latched into registers
(overhead - that increases latency per instruction)
42Control Hazards
- Simple techniques to handle control hazard
stalls - for every branch, introduce a stall cycle (note
every - 6th instruction is a branch!)
- assume the branch is not taken and start
fetching the - next instruction if the branch is taken,
need hardware - to cancel the effect of the wrong-path
instruction - fetch the next instruction (branch delay slot)
and - execute it anyway if the instruction turns
out to be - on the correct path, useful work was done
if the - instruction turns out to be on the wrong
path, - hopefully program state is not lost
43Slowdowns from Stalls
- Perfect pipelining with no hazards ? an
instruction - completes every cycle (total cycles num
instructions) - ? speedup increase in clock speed num
pipeline stages - With hazards and stalls, some cycles ( stall
time) go by - during which no instruction completes, and then
the stalled - instruction completes
- Total cycles number of instructions stall
cycles - Slowdown because of stalls 1/ (1 stall
cycles per instr)
44Control and Datapath Split state diag into 5
pieces
IR lt- MemPC PC lt PC4
A lt- Rrs Blt Rrt
S lt A B
S lt A SX
S lt A or ZX
S lt A SX
If Cond PC lt PCSX
M lt MemS
MemS lt- B
Rrd lt S
Rrd lt M
Rrt lt S
Equal
Reg. File
Reg File
Exec
PC
IR
Next PC
Inst. Mem
Mem Access
Data Mem
45Three Generic Data Hazards
- InstrI followed by InstrJ
- Read After Write (RAW) InstrJ tries to read
operand before InstrI writes it
46Three Generic Data Hazards
- InstrI followed by InstrJ
- Write After Read (WAR) InstrJ tries to write
operand before InstrI reads i - Gets wrong operand
- Cant happen in DLX 5 stage pipeline because
- All instructions take 5 stages, and
- Reads are always in stage 2, and
- Writes are always in stage 5
47Three Generic Data Hazards
- InstrI followed by InstrJ
- Write After Write (WAW) InstrJ tries to write
operand before InstrI writes it - Leaves wrong result ( InstrI not InstrJ )
- Cant happen in DLX 5 stage pipeline because
- All instructions take 5 stages, and
- Writes are always in stage 5
- Can have WAR and WAW in more complicated pipes
48Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
Fast code LW Rb,b LW Rc,c LW Re,e
ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB
Rd,Re,Rf SW d,Rd
49Summary Pipelining
- Reduce CPI by overlapping many instructions
- Average throughput of approximately 1 CPI with
fast clock - Utilize capabilities of the Datapath
- start next instruction while working on the
current one - limited by length of longest stage (plus
fill/flush) - detect and resolve hazards
- What makes it easy
- all instructions are the same length
- just a few instruction formats
- memory operands appear only in loads and stores
- What makes it hard?
- structural hazards suppose we had only one
memory - control hazards need to worry about branch
instructions - data hazards an instruction depends on a
previous instruction
50Some Issues for your consideration
- Wont be tested
- Well talk about modern processors and whats
really hard - exception handling
- trying to improve performance with out-of-order
execution, etc. - Trying to get CPI lt 1 (Superscalar execution)
51Superscalar Execution
- Throwing more hardware at the problem
- Instruction level parallelism (ILP)
- Multiple functional units
- Eg. Multiple ALUs
- Add a, b, c
- Add d, e, f
- Can get CPI lt1!
52Out-of-order execution
- Idea Its best if we keep all functional units
busy - Can sometimes reorder computations to take
advantage of functional units that are otherwise
idle - Automatically do reordering like we did 4 slides
ago!
53Register Renaming
- Internally rename registers, allow for better ILP
- Add a, b, c
- Add b, c, d
54Hyperthreading/Multicore
- Hyperthreading
- gt1 virtual CPUs
- Multi-core
- gt1 actual CPUs per die
55Integrated Circuits Costs
IC cost Die cost Testing cost
Packaging cost
Final test yield Die cost
Wafer cost Dies per
Wafer Die yield Dies per wafer (
Wafer_diam / 2)2 Wafer_diam Test
dies
Die Area 2 Die Area
Die Yield Wafer yield 1
Defects_per_unit_area Die_Area
a
- a
Die Cost goes roughly with die area4
56Real World Examples
- Chip Metal Line Wafer Defect Area Dies/ Yield Die
Cost layers width cost
/cm2 mm2 wafer - 386DX 2 0.90 900 1.0 43 360 71 4
- 486DX2 3 0.80 1200 1.0 81 181 54 12
- PowerPC 601 4 0.80 1700 1.3 121 115 28 53
- HP PA 7100 3 0.80 1300 1.0 196 66 27 73
- DEC Alpha 3 0.70 1500 1.2 234 53 19 149
- SuperSPARC 3 0.70 1700 1.6 256 48 13 272
- Pentium 3 0.80 1500 1.5 296 40 9 417
- From "Estimating IC Manufacturing Costs, by
Linley Gwennap, Microprocessor Report, August 2,
1993, p. 15
57Midterm Questions
- Examples
- List and describe 3 types of DRAM
- What are the relative advantages/disadvantages of
RISC/CISC - What do we have a memory heirarchy?
- Using your choice of assembly write a (commented)
routine that computes the nth fibonnaci number. - Why do CPUs have registers?
- Describe how a 3-disk RAID-5 system works
58Midterm Questions
- More Examples
- What are the differences between an synchronous
and asynchronous bus? What are the relative
advantages/disadvantages? - List and describe techniques to improve cache
miss rate, reduce cache miss penalty and reduce
cache hit times
59Topics for further study
- The following slides will not be covered in class
or on tests.
60Multicycle Instructions
61Effects of Multicycle Instructions
- Structural hazards if the unit is not fully
pipelined (divider) - Frequent RAW hazard stalls
- Potentially multiple writes to the register file
in a cycle - WAW hazards because of out-of-order instr
completion - Imprecise exceptions because of o-o-o instr
completion
62Precise Exceptions
- On an exception
- must save PC of instruction where program must
resume - all instructions after that PC that might be in
the pipeline - must be converted to NOPs (other instructions
continue - to execute and may raise exceptions of their
own) - temporary program state not in memory (in other
words, - registers) has to be stored in memory
- potential problems if a later instruction has
already - modified memory or registers
- A processor that fulfils all the above
conditions is said to - provide precise exceptions (useful for
debugging and of - course, correctness)
63Dealing with these Effects
- Multiple writes to the register file increase
the number of - ports, stall one of the writers during ID,
stall one of the - writers during WB (the stall will propagate)
- WAW hazards detect the hazard during ID and
stall the - later instruction
- Imprecise exceptions buffer the results if they
complete - early or save more pipeline state so that you
can return to - exactly the same state that you left at
64ILP
- Instruction-level parallelism overlap among
instructions - pipelining or multiple instruction execution
- What determines the degree of ILP?
- dependences property of the program
- hazards property of the pipeline
65Types of Dependences
- Data dependences an instr produces a result for
another - (true dependence, results in RAW hazards in a
pipeline) - Name dependences two instrs that use the same
names - (anti and output dependences, result in WAR and
WAW - hazards in a pipeline)
- Control dependences an instructions execution
depends - on the result of a branch re-ordering should
preserve - exception behavior and dataflow
66An Out-of-Order Processor Implementation
Reorder Buffer (ROB)
Branch prediction and instr fetch
Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6
T1 T2 T3 T4 T5 T6
Register File R1-R32
R1 ? R1R2 R2 ? R1R3 BEQZ R2 R3 ? R1R2 R1 ?
R3R2
Decode Rename
T1 ? R1R2 T2 ? T1R3 BEQZ T2 T4 ? T1T2 T5 ?
T4T2
ALU
ALU
ALU
Instr Fetch Queue
Results written to ROB and tags broadcast to IQ
Issue Queue (IQ)
67Design Details - I
- Instructions enter the pipeline in order
- No need for branch delay slots if prediction
happens in time - Instructions leave the pipeline in order all
instructions - that enter also get placed in the ROB the
process of an - instruction leaving the ROB (in order) is
called commit - an instruction commits only if it and all
instructions before - it have completed successfully (without an
exception) - To preserve precise exceptions, a result is
written into the - register file only when the instruction commits
until then, - the result is saved in a temporary register in
the ROB
68Design Details - II
- Instructions get renamed and placed in the issue
queue - some operands are available (T1-T6 R1-R32),
while - others are being produced by instructions in
flight (T1-T6) - As instructions finish, they write results into
the ROB (T1-T6) - and broadcast the operand tag (T1-T6) to the
issue queue - instructions now know if their operands are
ready - When a ready instruction issues, it reads its
operands from - T1-T6 and R1-R32 and executes (out-of-order
execution) - Can you have WAW or WAR hazards? By using more
- names (T1-T6), name dependences can be avoided
69Design Details - III
- If instr-3 raises an exception, wait until it
reaches the top - of the ROB at this point, R1-R32 contain
results for all - instructions up to instr-3 save registers,
save PC of instr-3, - and service the exception
- If branch is a mispredict, flush all
instructions after the - branch and start on the correct path
mispredicted instrs - will not have updated registers (the branch
cannot commit - until it has completed and the flush happens as
soon as the - branch completes)
- Potential problems ?
70Managing Register Names
Temporary values are stored in the register file
and not the ROB
Logical Registers R1-R32
Physical Registers P1-P64
At the start, R1-R32 can be found in
P1-P32 Instructions stop entering the pipeline
when P64 is assigned
R1 ? R1R2 R2 ? R1R3 BEQZ R2 R3 ? R1R2
P33 ? P1P2 P34 ? P33P3 BEQZ P34 P35 ? P33P34
What happens on commit?
71The Commit Process
- On commit, no copy is required
- The register map table is updated the
committed value - of R1 is now in P33 and not P1 on an
exception, P33 is - copied to memory and not P1
- An instruction in the issue queue need not
modify its - input operand when the producer commits
- When instruction-1 commits, we no longer have
any use - for P1 it is put in a free pool and a new
instruction can - now enter the pipeline ? for every instr that
commits, a - new instr can enter the pipeline ? number of
in-flight - instrs is a constant number of extra (rename)
registers
72The Alpha 21264 Out-of-Order Implementation
Reorder Buffer (ROB)
Branch prediction and instr fetch
Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6
Register File P1-P64
Register Map Table R1?P1 R2?P2
R1 ? R1R2 R2 ? R1R3 BEQZ R2 R3 ? R1R2 R1 ?
R3R2
Decode Rename
P33 ? P1P2 P34 ? P33P3 BEQZ P34 P35 ?
P33P34 P36 ? P35P34
ALU
ALU
ALU
Instr Fetch Queue
Results written to regfile and tags broadcast to
IQ
Issue Queue (IQ)
73Lecture 11 Advanced Static ILP
- Topics loop unrolling, software pipelining
(Section 4.4)
74Loop Dependences
- If a loop only has dependences within an
iteration, the loop - is considered parallel ? multiple iterations
can be executed - together so long as order within an iteration
is preserved - If a loop has dependeces across iterations, it
is not parallel - and these dependeces are referred to as
loop-carried - Not all loop-carried dependences imply lack of
parallelism - Parallel loops are especially desireable in a
multiprocessor - system
75Examples
For (i1000 igt0 ii-1) xi xi s
No dependences
For (i1 ilt100 ii1) Ai1 Ai
Ci S1 Bi1 Bi Ai1
S2
S2 depends on S1 in the same iteration S1 depends
on S1 from prev iteration S2 depends on S2 from
prev iteration
For (i1 ilt100 ii1) Ai Ai
Bi S1 Bi1 Ci Di
S2
S1 depends on S2 from prev iteration
S1 depends on S1 from 3 prev iterations Referred
to as a recursion Dependence distance 3 limited
parallelism
For (i1000 igt0 ii-1) xi xi-3 s
S1
76Finding Dependences the GCD Test
- Do Aai b and Aci d refer to the same
element? - Restrict ourselves to affine array indices
(expressible as - ai b, where i is the loop index, a and b are
constants) - example of non-affine index xyi
- For a dependence to exist, must have two indices
j and k - that are within the loop bounds, such that
- aj b ck d
- aj ck d b
- G GCD(a,c)
- (aj/G - ck/G) (d-b)/G
-
- If (d-b)/G is not an integer, the initial
equality can not be true
77Static vs. Dynamic ILP
Loop L.D F0, 0(R1) F0
array element ADD.D F4, F0, F2
add scalar S.D F4,
0(R1) store result
DADDUI R1, R1, -8 decrement address
pointer BNE R1, R2, Loop
branch if R1 ! R2
Loop L.D F0, 0(R1)
L.D F6, -8(R1) L.D
F10,-16(R1) L.D F14,
-24(R1) ADD.D F4, F0, F2
ADD.D F8, F6, F2 ADD.D
F12, F10, F2 ADD.D F16, F14,
F2 S.D F4, 0(R1)
S.D F8, -8(R1) DADDUI
R1, R1, -32 S.D F12,
16(R1) BNE R1,R2, Loop
S.D F16, 8(R1)
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R1, R1, -8 BNE R1,
R2, Loop L.D F0, 0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) DADDUI R1, R1, -8 BNE
R1, R2, Loop L.D F0, 0(R1) ADD.D
F4, F0, F2 S.D F4, 0(R1) ..
Statically unrolled loop
Large window dynamic ooo proc
78Dynamic ILP
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R1, R1, -8 BNE R1,
R2, Loop L.D F0, 0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) DADDUI R1, R1, -8 BNE
R1, R2, Loop L.D F0, 0(R1) ADD.D
F4, F0, F2 S.D F4, 0(R1) DADDUI R1,
R1, -8 BNE R1, R2, Loop L.D F0,
0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2, Loop
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R3, R1, -8 BNE R3,
R2, Loop L.D F6, 0(R3) ADD.D F8, F6,
F2 S.D F8, 0(R3) DADDUI R4, R3, -8 BNE
R4, R2, Loop L.D F10, 0(R4) ADD.D
F12, F10, F2 S.D F12, 0(R4) DADDUI R5,
R4, -8 BNE R5, R2, Loop L.D F14,
0(R5) ADD.D F16, F14, F2 S.D F16,
0(R5) DADDUI R6, R5, -8 BNE R6, R2, Loop
Renamed
79Dynamic ILP
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R1, R1, -8 BNE R1,
R2, Loop L.D F0, 0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) DADDUI R1, R1, -8 BNE
R1, R2, Loop L.D F0, 0(R1) ADD.D
F4, F0, F2 S.D F4, 0(R1) DADDUI R1,
R1, -8 BNE R1, R2, Loop L.D F0,
0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2, Loop
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R3, R1, -8 BNE R3,
R2, Loop L.D F6, 0(R3) ADD.D F8, F6,
F2 S.D F8, 0(R3) DADDUI R4, R3, -8 BNE
R4, R2, Loop L.D F10, 0(R4) ADD.D
F12, F10, F2 S.D F12, 0(R4) DADDUI R5,
R4, -8 BNE R5, R2, Loop L.D F14,
0(R5) ADD.D F16, F14, F2 S.D F16,
0(R5) DADDUI R6, R5, -8 BNE R6, R2, Loop
1 3 6 1 3 2 4 7 2 4 3 5 8 3 5 4 6 9 4 6
Cycle of Issue
Renamed
80Loop Pipeline
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
DADDUI
BNE
L.D
ADD.D
DADDUI
BNE
81Statically Unrolled Loop
Loop L.D F0, 0(R1)
L.D F6, -8(R1) L.D
F10,-16(R1) L.D F14,
-24(R1) L.D F18, -32(R1)
ADD.D F4, F0, F2 L.D
F22, -40(R1) ADD.D F8, F6, F2
L.D F26, -48(R1) ADD.D F12, F10, F2
L.D F30, -56(R1) ADD.D
F16, F14, F2 L.D F34,
-64(R1) ADD.D F20, F18, F2 S.D
F4, 0(R1) L.D F38, -72(R1)
ADD.D F24, F22, F2 S.D F8, -8(R1)
S.D
F12, 16(R1)
S.D F16, 8(R1) DADDUI
R1, R1, -32 S.D
BNE R1,R2, Loop S.D
82Static Vs. Dynamic
New iterations completed
1
Dynamic ILP
Cycles
New iterations completed
1
Static ILP
Cycles
- What if I doubled the number of resources in
each processor? - What if I unrolled the loop and executed it on a
dynamic ILP processor?
83Static vs. Dynamic
- Dynamic because of the loop index, at most one
iteration - can start every cycle even fewer if there are
resource - constraints in other words, we have a
pipeline that has - a throughput of one iteration per cycle!
- Static by eliminating loop index, each
iteration is - independent ? as many loops can start in a
cycle as there - are resources however, after a while, we
dont start any - more iterations thus, loop unrolling provides
a brief steady - state, where an iteration starts/finishes every
cycle and the - rest is start-up/wind-down for each unrolled
loop
84Software Pipeline?!
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
DADDUI
BNE
L.D
ADD.D
DADDUI
BNE
85Software Pipelining
Loop L.D F0, 0(R1)
ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R1,
R1, -8 BNE R1, R2, Loop
Loop S.D F4, 16(R1)
ADD.D F4, F0, F2 L.D
F0, 0(R1) DADDUI R1,
R1, -8 BNE R1, R2, Loop
- Advantages achieves nearly the same effect as
loop unrolling, but - without the code expansion an unrolled loop
may have inefficiencies - at the start and end of each iteration, while a
sw-pipelined loop is - almost always in steady state a sw-pipelined
loop can also be unrolled - to reduce loop overhead
- Disadvantages does not reduce loop overhead,
may require more - registers
86Loop Dependences
- If a loop only has dependences within an
iteration, the loop - is considered parallel ? multiple iterations
can be executed - together so long as order within an iteration
is preserved - If a loop has dependeces across iterations, it
is not parallel - and these dependeces are referred to as
loop-carried - Not all loop-carried dependences imply lack of
parallelism - Parallel loops are especially desireable in a
multiprocessor - system
87Examples
For (i1000 igt0 ii-1) xi xi s
No dependences
For (i1 ilt100 ii1) Ai1 Ai
Ci S1 Bi1 Bi Ai1
S2
S2 depends on S1 in the same iteration S1 depends
on S1 from prev iteration S2 depends on S2 from
prev iteration
For (i1 ilt100 ii1) Ai Ai
Bi S1 Bi1 Ci Di
S2
S1 depends on S2 from prev iteration
S1 depends on S1 from 3 prev iterations Referred
to as a recursion Dependence distance 3 limited
parallelism
For (i1000 igt0 ii-1) xi xi-3 s
S1
88Constructing Parallel Loops
If loop-carried dependences are not cyclic (S1
depending on S1 is cyclic), loops can be
restructured to be parallel
For (i1 ilt100 ii1) Ai Ai
Bi S1 Bi1 Ci Di
S2
A1 A1 B1 For (i1 ilt99 ii1)
Bi1 Ci Di S3 Ai1
Ai1 Bi1 S4 B101 C100
D100
S1 depends on S2 from prev iteration
S4 depends on S3 of same iteration
89Finding Dependences the GCD Test
- Do Aai b and Aci d refer to the same
element? - Restrict ourselves to affine array indices
(expressible as - ai b, where i is the loop index, a and b are
constants) - example of non-affine index xyi
- For a dependence to exist, must have two indices
j and k - that are within the loop bounds, such that
- aj b ck d
- aj ck d b
- G GCD(a,c)
- (aj/G - ck/G) (d-b)/G
-
- If (d-b)/G is not an integer, the initial
equality can not be true
90Predication
- A branch within a loop can be problematic to
schedule - Control dependences are a problem because of the
need - to re-fetch on a mispredict
- For short loop bodies, control dependences can
be - converted to data dependences by using
- predicated/conditional instructions
91Predicated or Conditional Instructions
- The instruction has an additional operand that
determines - whether the instr completes or gets converted
into a no-op - Example lwc R1, 0(R2), R3
(load-word-conditional) - will load the word at address (R2) into R1 if
R3 is non-zero - if R3 is zero, the instruction becomes a no-op
- Replaces a control dependence with a data
dependence - (branches disappear) may need register copies
for the - condition or for values used by both directions
if (R1 0) R2 R2 R4 else R6 R3
R5 R4 R2 R3
R7 !R1 R8 R2 R2 R2 R4 (predicated
on R7) R6 R3 R5 (predicated on R1) R4 R8
R3 (predicated on R1)
92Complications
- Each instruction has one more input operand
more - register ports/bypassing
- If the branch condition is not known, the
instruction stalls - (remember, these are in-order processors)
- Some implementations allow the instruction to
continue - without the branch condition and
squash/complete later in - the pipeline wasted work
- Increases register pressure, activity on
functional units - Does not help if the br-condition takes a while
to evaluate
93Support for Speculation
- In general, when we re-order instructions,
register renaming - can ensure we do not violate register data
dependences - However, we need hardware support
- to ensure that an exception is raised at the
correct point - to ensure that we do not violate memory
dependences
st br ld
94Detecting Exceptions
- Some exceptions require that the program be
terminated - (memory protection violation), while other
exceptions - require execution to resume (page faults)
- For a speculative instruction, in the latter
case, servicing - the exception only implies potential
performance loss - In the former case, you want to defer servicing
the - exception until you are sure the instruction is
not speculative - Note that a speculative instruction needs a
special opcode - to indicate that it is speculative
95Program-Terminate Exceptions
- When a speculative instruction experiences an
exception, - instead of servicing it, it writes a special
NotAThing value - (NAT) in the destination register
- If a non-speculative instruction reads a NAT, it
flags the - exception and the program terminates (it may
not be - desireable that the error is caused by an array
access, but - the core-dump happens two procedures later)
- Alternatively, an instruction (the sentinel) in
the speculative - instructions original location checks the
register value and - initiates recovery
96Memory Dependence Detection
- If a load is moved before a preceding store, we
must - ensure that the store writes to a
non-conflicting address, - else, the load has to re-execute
- When the speculative load issues, it stores its
address in - a table (Advanced Load Address Table in the
IA-64) - If a store finds its address in the ALAT, it
indicates that a - violation occurred for that address
- A special instruction (the sentinel) in the
loads original - location checks to see if the address had a
violation and - re-executes the load if necessary
97Dynamic Vs. Static ILP
- Static ILP
- The compiler finds parallelism ? no
scoreboarding ? - higher clock speeds and lower power
- Compiler knows what is next ? better global
schedule - - Compiler can not react to dynamic events
(cache misses) - - Can not re-order instructions unless you
provide - hardware and extra instructions to detect
violations - (eats into the low complexity/power argument)
- - Static branch prediction is poor ? even
statically - scheduled processors use hardware branch
predictors - - Building an optimizing compiler is easier said
than done - A comparison of the Alpha, Pentium 4, and
Itanium (statically - scheduled IA-64 architecture) shows that the
Itanium is not - much better in terms of performance, clock
speed or power
98Summary
- Topics scheduling, loop unrolling, software
pipelining, - predication, violations while re-ordering
instructions - Static ILP is a great approach for handling
embedded - domains
- For the high performance domain, designers have
added - many frills, bells, and whistles to eke out
additional - performance, while compromising
power/complexity