Title: CSC%204250%20Computer%20Architectures
1CSC 4250Computer Architectures
- October 20, 2006
- Chapter 3. Instruction-Level Parallelism
- Its Dynamic Exploitation
2One More Example on Tomasulos Algorithm
- L.D F0,0(R0)
- ADD.D F0,F0,F2
- MUL.D F0,F0,F4
- ADD.D F0,F0,F2
- MUL.D F0,F0,F4
- S.D F0,0(R0)
- ADD.D F0,F4,F2
3IBM 360 Assembly Language
- Only two operands. Advantage? Disadvantage?
- Example
- L.D F0,0(R0)
- ADD.D F0,F2
- MUL.D F0,F4
- ADD.D F0,F2
- MUL.D F0,F4
- S.D F0,0(R0)
-
4Figure 0.1
Instruction Issue Execute Write Result
L.D F0,0(R0) v
ADD.D F0,F0,F2
MUL.D F0,F0,F4
ADD.D F0,F0,F2
MUL.D F0,F0,F4
S.D F0,0(R0)
ADD.D F0,F4,F2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 0RegR0
Add1 No
Add2 No
Add3 No
Mult1 No
Mult2 No
Store1 No
F0 F2 F4 F6 F8 F10 F12 F30
Qi Load1
5Figure 0.2
Instruction Issue Execute Write Result
L.D F0,0(R0) v v
ADD.D F0,F0,F2 v
MUL.D F0,F0,F4
ADD.D F0,F0,F2
MUL.D F0,F0,F4
S.D F0,0(R0)
ADD.D F0,F4,F2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 0RegR0
Add1 Yes Add RegF2 Load1
Add2 No
Add3 No
Mult1 No
Mult2 No
Store1 No
F0 F2 F4 F6 F8 F10 F12 F30
Qi Add1
6Figure 0.3
Instruction Issue Execute Write Result
L.D F0,0(R0) v v
ADD.D F0,F0,F2 v
MUL.D F0,F0,F4 v
ADD.D F0,F0,F2
MUL.D F0,F0,F4
S.D F0,0(R0)
ADD.D F0,F4,F2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 0RegR0
Add1 Yes Add RegF2 Load1
Add2 No
Add3 No
Mult1 Yes Mult RegF4 Add1
Mult2 No
Store1 No
F0 F2 F4 F6 F8 F10 F12 F30
Qi Mult1
7Figure 0.4
Instruction Issue Execute Write Result
L.D F0,0(R0) v v
ADD.D F0,F0,F2 v
MUL.D F0,F0,F4 v
ADD.D F0,F0,F2 v
MUL.D F0,F0,F4
S.D F0,0(R0)
ADD.D F0,F4F2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 0RegR0
Add1 Yes Add RegF2 Load1
Add2 Yes Add RegF2 Mult1
Add3 No
Mult1 Yes Mult RegF4 Add1
Mult2 No
Store1 No
F0 F2 F4 F6 F8 F10 F12 F30
Qi Add2
8Figure 0.5
Instruction Issue Execute Write Result
L.D F0,0(R0) v v
ADD.D F0,F0,F2 v
MUL.D F0,F0,F4 v
ADD.D F0,F0,F2 v
MUL.D F0,F0,F4 v
S.D F0,0(R0)
ADD.D F0,F4,F2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 0RegR0
Add1 Yes Add RegF2 Load1
Add2 Yes Add RegF2 Mult1
Add3 No
Mult1 Yes Mult RegF4 Add1
Mult2 Yes Mult RegF4 Add2
Store1 No
F0 F2 F4 F6 F8 F10 F12 F30
Qi Mult2
9Figure 0.6
Instruction Issue Execute Write Result
L.D F0,0(R0) v v
ADD.D F0,F0,F2 v
MUL.D F0,F0,F4 v
ADD.D F0,F0,F2 v
MUL.D F0,F0,F4 v
S.D F0,0(R0) v
ADD.D F0,F4,F2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 0RegR0
Add1 Yes Add RegF2 Load1
Add2 Yes Add RegF2 Mult1
Add3 No
Mult1 Yes Mult RegF4 Add1
Mult2 Yes Mult RegF4 Add2
Store1 Yes Store Mult2 0RegR0
F0 F2 F4 F6 F8 F10 F12 F30
Qi Mult2
10Figure 0.7
Instruction Issue Execute Write Result
L.D F0,0(R0) v v
ADD.D F0,F0,F2 v
MUL.D F0,F0,F4 v
ADD.D F0,F0,F2 v
MUL.D F0,F0,F4 v
S.D F0,0(R0) v
ADD.D F0,F4,F2 v
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 0RegR0
Add1 Yes Add RegF2 Load1
Add2 Yes Add RegF2 Mult1
Add3 Yes Add RegF4 RegF2
Mult1 Yes Mult RegF4 Add1
Mult2 Yes Mult RegF4 Add2
Store1 Yes Store Mult2 0RegR0
F0 F2 F4 F6 F8 F10 F12 F30
Qi Add3
11Figure 0.8
Instruction Issue Execute Write Result
L.D F0,0(R0) v v
ADD.D F0,F0,F2 v
MUL.D F0,F0,F4 v
ADD.D F0,F0,F2 v
MUL.D F0,F0,F4 v
S.D F0,0(R0) v
ADD.D F0,F4,F2 v v v
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load 0RegR0
Add1 Yes Add RegF2 Load1
Add2 Yes Add RegF2 Mult1
Add3 No
Mult1 Yes Mult RegF4 Add1
Mult2 Yes Mult RegF4 Add2
Store1 Yes Store Mult2 0RegR0
F0 F2 F4 F6 F8 F10 F12 F30
Qi
12Modified Loop-Based Example
- Loop L.D F0,0(R1)
- MUL.D F0,F0,F2
- ADD.D F0,F0,F4
- S.D F0,0(R1)
- DADDIU R1,R1,-8
- BNE R1,R2,Loop
13Figure 0.1. One active iteration of loop
Instruction Iteration Issue Execute Write Result
L.D F0,0(R1) 1 v v
MUL.D F0,F0,F2 1 v
ADD.D F0,F0,F4 1 v
S.D F0,0(R1) 1 v
L.D F0,0(R1) 2
MUL.D F0,F0,F2 2
ADD.D F0,F0,F4 2
S.D F0,0(R1) 2
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load RegR1
Load2 No
Add1 Yes Add RegF4 Mult1
Add2 No
Mult1 Yes Mult RegF2 Load1
Mult2 No
Store1 Yes Store Add1 RegR1
Store2 No
F0 F2 F4 F6 F8 F10 F12 F30
Qi Add1
14Figure 0.2. Two active iterations of loop
Instruction Iteration Issue Execute Write Result
L.D F0,0(R1) 1 v v
MUL.D F0,F0,F2 1 v
ADD.D F0,F0,F4 1 v
S.D F0,0(R1) 1 v
L.D F0,0(R1) 2 v v
MUL.D F0,F0,F2 2 v
ADD.D F0,F0,F4 2 v
S.D F0,0(R1) 2 v
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load RegR1
Load2 Yes Load RegR1-8
Add1 Yes Add RegF4 Mult1
Add2 Yes Add RegF4 Mult2
Mult1 Yes Mult RegF2 Load1
Mult2 Yes Mult RegF2 Load2
Store1 Yes Store Add1 RegR1
Store2 Yes Add2 RegR1-8
F0 F2 F4 F6 F8 F10 F12 F30
Qi Add2
15Figure 0.2. Two active iterations of loop
Instruction Iteration Issue Execute Write Result
L.D F0,0(R1) 1 v v
MUL.D F0,F0,F2 1 v
ADD.D F0,F0,F4 1 v
S.D F0,0(R1) 1 v
L.D F0,0(R1) 2 v v
MUL.D F0,F0,F2 2 v
ADD.D F0,F0,F4 2 v
S.D F0,0(R1) 2 v
Name Busy Op Vj Vk Qj Qk A
Load1 Yes Load RegR1
Load2 Yes Load RegR1-8
Add1 Yes Add RegF4 Mult1
Add2 Yes Add RegF4 Mult2
Mult1 Yes Mult RegF2 Load1
Mult2 Yes Mult RegF2 Load2
Store1 Yes Store Add1 RegR1
Store2 Yes Add2 RegR1-8
F0 F2 F4 F6 F8 F10 F12 F30
Qi Add2
16Dynamic Branch Prediction
- Static branch prediction in Appendix A
- Branch Prediction Buffer a small memory indexed
by the lower portion of the address of the branch
instruction. The memory contains a bit that says
whether the branch was recently taken or not - The prediction bit may have been placed there by
another instruction
17Figure 3.14. A Branch Prediction Buffer
- Use the 4 low-order address bits of the branch
(word address) to choose a row.
18Nested Loops
- Loop1 L.D F2,1600(R1)
- DADDIU R2,R0,80
- Loop2 L.D F0,1000(R2)
- ADD.D F0,F0,F2
- S.D F0,1000(R2)
- DADDIU R2,R2,-8
- BNEZ R2,Loop2
- DADDIU R1,R1,-8
- BNEZ R1,Loop1
19Figure 3.7. States in 2-bit Prediction Scheme
20Figure 3.8. Prediction Accuracy of 4096-entry
2-bit Prediction Buffer for SPEC89 Benchmarks
21Figure 3.9. Prediction Accuracy of 4096-entry
2-bit Prediction Buffer versus an infinite 2-bit
Prediction Buffer for SPEC89