Title: Instruction Set Architecture (ISA)
1Instruction Set Architecture (ISA)
software
instruction set
hardware
2Interface Design
- A good interface
- Lasts through many implementations (portability,
compatibility) - Is used in many differeny ways (generality)
- Provides convenient functionality to higher
levels - Permits an efficient implementation at lower
levels
use
time
imp 1
Interface
use
imp 2
use
imp 3
3Evolution of Instruction Sets
Single Accumulator (EDSAC 1950)
Accumulator Index Registers
(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model from
Implementation
High-level Language Based
Concept of a Family
(B5000 1963)
(IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets
Load/Store Architecture
(CDC 6600, Cray 1 1963-76)
(Vax, Intel 432 1977-80)
RISC
(Mips,Sparc,HP-PA,IBM RS6000, . . .1987)
4Evolution of Instruction Sets
- Major advances in computer architecture are
typically associated with landmark instruction
set designs - Ex Stack vs GPR (System 360)
- Design decisions must take into account
- technology
- machine organization
- programming languages
- compiler technology
- operating systems
- And they in turn influence these
5What influences ISA Design?
- The need to refer to values / memory
- Registers
- Main memory
- But possibly
- Cache?
- Values on a stack?
- Why do these choices exist?
6Addressing Modes
- Register add R4, R3
- Immediate add R4, 3
- Displacement add R4, 100R3
- Register indirect add R4, R3
- Index add R4, R1R2
- Direct add R4, 1001
- Memory Indirect add R4, _at_R3
- Autoincrement add R4,R3
- Autodecrement add R4, R3-
- Scales add R4, 100R2R3
7A "Typical" RISC
- 32-bit fixed format instruction (3 formats)
- 32 32-bit GPR (R0 contains zero, DP take pair)
- 3-address, reg-reg arithmetic instruction
- Single address mode for load/store base
displacement - no indirection
- Simple branch conditions
- Delayed branch
see SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM
PowerPC, CDC 6600, CDC 7600, Cray-1,
Cray-2, Cray-3
8Example MIPS
Register-Register
5
6
10
11
31
26
0
15
16
20
21
25
Op
Rs1
Rs2
Rd
Opx
Register-Immediate
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rd
Branch
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rs2/Opx
Jump / Call
31
26
0
25
target
Op
9Warts x86
- Floating point co-processor design
- Complex string move instructions
- Used in practice
- Self-modifying code
- Condition registers
10Pipelining Its Natural!
- Laundry Example
- Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold - Washer takes 30 minutes
- Dryer takes 40 minutes
- Folder takes 20 minutes
11Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
- Sequential laundry takes 6 hours for 4 loads
- If they learned pipelining, how long would
laundry take?
12Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
- Pipelined laundry takes 3.5 hours for 4 loads
13Pipelining Lessons
- Pipelining doesnt help latency of single task,
it helps throughput of entire workload - Pipeline rate limited by slowest pipeline stage
- Multiple tasks operating simultaneously
- Potential speedup Number pipe stages
- Unbalanced lengths of pipe stages reduces speedup
- Time to fill pipeline and time to drain it
reduces speedup
6 PM
7
8
9
Time
T a s k O r d e r
14Computer Pipelines
- Execute billions of instructions, so throughput
is what matters - DLX desirable features all instructions same
length, registers located in same place in
instruction format, memory operands only in loads
or stores
155 Steps of DLX DatapathFigure 3.1, Page 130
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
IR
L M D
16Fetch Decode
- Instruction Fetch (IF)
- IR lt- MEMPC
- NPC lt- PC4
- Decode / register fetch (ID)
- A lt- REGS IR6..10
- B lt- REGS IR11..15
- Imm lt- IR16..31
17Execute Step
- Memory Reference
- ALUOutput lt- A Imm
- Calculates effective address of the memory
operation - Reg-Reg ALU
- ALUOutput lt- A func B
- Reg-Imm ALU
- ALUOutput lt- A op Imm
- Branch
- ALUOutput lt- NPC Imm
- Cond lt- (A op 0)
18Memory Access
- Memory Reference
- LMD lt- Mem ALUOutput
- Or, Mem ALUOutput lt- B
- Branch
- If (cond) PC lt- ALUOutputelse PC lt- NPC
19Writeback
- Reg-Reg
- Regs IR16..20 lt- ALUOutput
- Reg-Imm
- Regs IR11..15 lt- ALUOutput
- Load
- Regs IR11..15 lt- LMD
20Non-Pipelined Implementation
- Branch and store instructions require four cycles
- All others require five cycles
- Assumes memory access is immediate, otherwise
its slower - Or, we could have implemented the machine with a
single long clock cycle. - No one would do this. Requires duplication of
shared units / information
21Pipelined DLX DatapathFigure 3.4, page 137
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
Write Back
Memory Access
- Data stationary control
- local decode for each instruction phase /
pipeline stage
22Pipelined Implementation
I IF ID EX MEM WB
I1 IF ID EX MEM WB
I2 IF ID EX MEM WB
I3 IF ID EX MEM WB
23Pipeline Latches
- Each instruction is active in only a single
pipeline stage at a time - The pipeline latches can also be used to simplify
testing debugging - Latches add overhead, though.
- But, some latch designs let us overlap
computation and latch overhead
24Visualizing Pipelining ResourcesFigure 3.3, Page
133
Instruction Memory
Time (clock cycles)
I n s t r. O r d e r
Data Memory
25Its Not That Easy for Computers
- Limits to pipelining Hazards prevent next
instruction from executing during its designated
clock cycle - Structural hazards HW cannot support this
combination of instructions (single person to
fold and put clothes away) - Data hazards Instruction depends on result of
prior instruction still in the pipeline (missing
sock) - Control hazards Pipelining of branches other
instructions stall the pipeline until the hazard
bubbles in the pipeline
26One Memory Port/Structural HazardsFigure 3.6,
Page 142
Time (clock cycles)
Data Memory
Load
I n s t r. O r d e r
Instr 1
Instr 2
Instr 3
Instr 4
Instruction Memory
27One Memory Port/Structural HazardsFigure 3.7,
Page 143
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
stall
Instr 3
28Structural Hazards
- How do you avoid them?
- Duplicate resources
- Pipeline resources
- Why would they exist?
- Cost e.g. duplicating memory interface is
expensive - Latency it may be better to avoid pipelining to
reduce the latency of a specified operation - Example CDC7600 MIPS R2010 FPU had shorter
latency rather than fully pipelined FP
operations. - Typically pipeline FMUL, but not e.g. FDIV
29Speed Up Equation for Pipelining
- CPIpipelined Ideal CPI Pipeline stall
clock cycles per instr - Speedup Ideal CPI x Pipeline depth Clock
Cycleunpipelined - Ideal CPI Pipeline stall CPI Clock
Cyclepipelined - Speedup Pipeline depth Clock
Cycleunpipelined - 1 Pipeline stall CPI Clock
Cyclepipelined
30Example Dual-port vs. Single-port
- Machine A Dual ported memory
- Machine B Single ported memory, but its
pipelined implementation has a 1.05 times faster
clock rate - Ideal CPI 1 for both
- Assume loads are 40 of executed instructions
- SpeedUpA Pipeline Depth/(1 0) x
(clockunpipe/clockpipe) - Pipeline Depth
- SpeedUpB Pipeline Depth/(1 0.4 x 1)
x (clockunpipe/(clockunpipe / 1.05) - (Pipeline Depth/1.4) x 1.05
- 0.75 x Pipeline Depth
- SpeedUpA / SpeedUpB Pipeline
Depth/(0.75 x Pipeline Depth) 1.33 - Machine A is 1.33 times faster
31Data Hazard on R1Figure 3.9, page 147
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
32Data Hazards
- SUB / AND read old value of R1
- And, depending on previous instructions, they may
read different old values - OR may read proper value if reads occur after
writes in the register file access (major / minor
clocks) - Only XOR would read the proper value
- Not deterministic interrupts affect timing
- But, people tried exposed pipelines
- MIPS
- Intel i860
33Three Generic Data Hazards
- InstrI followed by InstrJ
- Read After Write (RAW) InstrJ tries to read
operand before InstrI writes it
J
I
34Three Generic Data Hazards
- InstrI followed by InstrJ
- Write After Read (WAR) InstrJ tries to write
operand before InstrI reads i - Gets wrong operand
- Cant happen in DLX 5 stage pipeline because
- All instructions take 5 stages, and
- Reads are always in stage 2, and
- Writes are always in stage 5
J
I
35Three Generic Data Hazards
J
I
- InstrI followed by InstrJ
- Write After Write (WAW) InstrJ tries to write
operand before InstrI writes it - Leaves wrong result ( InstrI not InstrJ )
- Cant happen in DLX 5 stage pipeline because
- All instructions take 5 stages, and
- Writes are always in stage 5
- Could happen if WB for ALU happened in MEM stage,
or if MEM access took two cycles - Well see WAR and WAW in later more complicated
pipes
36Forwarding to Avoid Data HazardFigure 3.10, Page
149
Time (clock cycles)
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
37HW Change for ForwardingFigure 3.20, Page 161
38Data Hazard Even with ForwardingFigure 3.12,
Page 153
Time (clock cycles)
lw r1, 0(r2)
I n s t r. O r d e r
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
39Data Hazards Requiring Stalls
- LW doesnt have data until end of cycle 4 (MEM
cycle for LW) - SUB needs to have data by beginning of that cycle
- Thus, cant completely eliminate the hazard
- The easiest thing to do is use a pipeline
interlock to force a stall
40Data Hazard Even with ForwardingFigure 3.13,
Page 154
Time (clock cycles)
I n s t r. O r d e r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
41Prior to stall
Lw IF ID EX MEM WB
SUB IF ID EX MEM WB
AND IF ID EX MEM WB
OR IF ID EX MEM WB
42With stall
LW IF ID EX MEM WB
SUB IF ID stall EX MEM WB
AND IF stall ID EX MEM WB
OR stall IF ID EX MEM
43Example
- Suppose that 30 of instructions are loads that
½ the time, the instruction following the load
depends on the load value. - If this hazard creates a single-cycle delay, how
much faster is the ideal pipelined machine? - 0.7 1 0.3 1.5 1.15
- So, ideal machine is 15 faster
44Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
- Fast code
- LW Rb,b
- LW Rc,c
- LW Re,e
- ADD Ra,Rb,Rc
- LW Rf,f
- SW a,Ra
- SUB Rd,Re,Rf
- SW d,Rd
Stall
Stall
45How common are load stalls?
46Implementing Load Interlocks
- Software insert NOPS
- Hardware
- Is load destination source for subsequent
instruction? - Two possible registers in subsequent instruction
- Have to check for all possible formats!
47Control Hazard on BranchesThree Stage Stall
48Branch Stall Impact
- If CPI 1, 30 branch, Stall 3 cycles gt new CPI
1.9! - Two part solution
- Determine branch taken or not sooner (in ID), AND
- Compute taken branch address earlier
- DLX branch tests if register 0 or ! 0
- DLX Solution
- Move Zero test to ID/RF stage
- Adder to calculate new PC in ID/RF stage
- 1 clock cycle penalty for branch versus 3
- Data hazard stall if branch depends on result of
prior ALU operation - ADD R1 R2 R3
- BR R10, foo
49Alternatives
- Figuring out its a branch
- pre-decode the branch
- Computing the condition
- Use condition codes
- But the condition needs to be computed early
enough - Address
- Dont use relative branches
50Pipelined DLX DatapathFigure 3.22, page 163
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
This is the correct 1 cycle latency
implementation! needs mux!
51Four Branch Hazard Alternatives
- 1 Stall until branch direction is clear
- 2 Predict Branch Not Taken
- Execute successor instructions in sequence
- Squash instructions in pipeline if branch
actually taken - Advantage of late pipeline state update
- 47 DLX branches not taken on average
- PC4 already calculated, so use it to get next
instruction - 3 Predict Branch Taken
- 53 DLX branches taken on average
- But havent calculated branch target address in
DLX - DLX still incurs 1 cycle branch penalty
- Other machines branch target known before outcome
52Four Branch Hazard Alternatives
- 4 Delayed Branch
- Define branch to take place AFTER a following
instruction - branch instruction sequential
successor1 sequential successor2 ........ seque
ntial successorn - branch target if taken
- 1 slot delay allows proper decision and branch
target address in 5 stage pipeline - DLX uses this
Branch delay of length n
53Delayed Branch
A
- Where to get instructions to fill branch delay
slot? - Before branch instruction
- From the target address only valuable when
branch taken - From fall through only valuable when branch not
taken - Canceling branches allow more slots to be filled
Br
54Delayed Branch
- Compiler effectiveness for single branch delay
slot - Fills about 60 of branch delay slots
- About 80 of instructions executed in branch
delay slots useful in computation - About 50 (60 x 80) of slots usefully filled
- Problems
- Exposes pipeline design to user
- Increased pipeline depth -gt need more slots
- Increase issue width -gt need more slots
55Evaluating Branch Alternatives
- Scheduling Branch CPI speedup v. speedup v.
scheme penalty unpipelined stall - Stall pipeline 3 1.42 3.5 1.0
- Predict taken 1 1.14 4.4 1.26
- Predict not taken 1 1.09 4.5 1.29
- Delayed branch 0.5 1.07 4.6 1.31
- Conditional Unconditional 14, 65 change PC
56Hardware / Software
- Compiler based static branch prediction
- Use machine learning to guess branch direction
- Profile based prediction
- Run the program several times
- record behavior for runs
- Assume the past predicts the future
57Complexity - Exceptions
- Synch vs. async
- E.g. page faults vs. I/O completion
- User requested vs. coerced
- O/S transition vs. page fault
- User maskable vs. unmaskable
- Within vs. between instructions
- One word of multi-word operation causes fault
- Resume vs. terminate
58Complexity - Exceptions
- Restartable
- Machine provides mechanism to restart program
execution - Precise
- All instructions prior to the excepting condition
are committed, all following are not committed
59Exceptions - Ordering
- Consider exception arising in MEM and IF stages
- MEM because of invalid access
- IF also because of access
- The IF fault may occur before the MEM fault,
but the MEM fault needs to be reported first - How? Pipeline the exception state, throw
exceptions at WB
60Exceptions
- Well soon read a seminal paper on handling
precise exceptions - Another alternative is to use exception
barriers or trap barriers - Precise exceptions may be more than is needed by
many programs - We can allow the compiler/program to specify trap
barriers this may allow better execution
61NetburstTM Micro-architecture Pipeline vs P6
Intro at 733MHz .18µ
Intro at ³ 1.4GHz .18µ
Hyper pipelined Technology enables industry
leading performance and clock rate
62Hyper Pipelined Technology
63Pipelining Summary
- Just overlap tasks, and easy if tasks are
independent - Speed Up Pipeline Depth if ideal CPI is 1,
then - Hazards limit performance on computers
- Structural need more HW resources
- Data (RAW,WAR,WAW) need forwarding, compiler
scheduling - Control delayed branch, prediction
Pipeline Depth
Clock Cycle Unpipelined
Speedup
X
Clock Cycle Pipelined
1 Pipeline stall CPI