Title: William Stallings Computer Organization and Architecture 5th Edition
1William Stallings Computer Organization and
Architecture5th Edition
- Chapter 11
- CPU Structure and Function
- CPU??????
2Topics
- Processor Organization
- Register Organization
- Instruction Cycle
- Instruction Pipelining
- The Pentium Processor
3CPU Structure
- CPU must CPU???????
- Fetch instructions ??????????
- Interpret instructions ?????????
- Fetch data ????????
- Process data ???????
- Write data ????????????
- CPU??????????????????
CPU needs a small internal memory
4CPU With Systems Bus
5CPU Internal Structure
6Registers (???)
- CPU must have some working space (temporary
storage) CPU??????????????? - Called registers ?????????
- Number and function vary between processor
designs ?????????????????? - One of the major design decisions
- ??????CPU??????????
- Top level of memory hierarchy ????????????
- Two categories ????
- User-visible registers ???????
- Control and status registers ????????
7User Visible Registers ???????
- General Purpose ?? ???
- Data ?? ???
- Address ?? ???
- Condition Codes ???? ???
8User Visible Registers
- General Purpose Registers
- May be true general purpose ???????
- May be restricted ????????
- Data registers
- Accumulator register ?????
- Addressing registers
- Segment pointers ????
- Index registers ?????
- Stack Pointer ?????
9General or Special? ??
- Make them general purpose
- Increase flexibility and programmer options
- ???????????????
- Increase instruction size complexity
- ????????????
- Make them specialized
- Smaller (faster) instructions ??????
- Less flexibility ?????
- The trend seems to be toward the use of
specialized registers. - ??????????
10How Many GP Registers? ??
- Between 8 32 ??8-32?
- Fewer more memory references
- ???????,?????????
- More does not reduce memory references
- ???????????????????
11How big? ??????
- Large enough to hold the largest address
- ??????????
- Large enough to hold most data types
- ??????????????
- Often possible to combine two data registers
- ????????????????
- C programming
- double a
- long int a
12Condition Code Registers
- Condition codes are bits set by the CPU hardware
as the result of operations. - Sets of individual bits ??????
- e.g. result of last operation was zero
- At least partially visible to the user
- ?????????
- Can be read (implicitly) by programs ??????
- e.g. Jump if zero
- Can not (usually) be set by programs
- ???????????
13Control Status Registers
- Program Counter ?????(PC)
- Instruction Register ?????(IR)
- Memory Address Register ???????MAR
- Memory Buffer Register ???????MBR
- Revision what do these all do?
14Program Status Word ?????PSW
- A set of bits,Includes Condition Codes ?????
- Sign of last result ?????????????
- Zero ????????????
- Carry ????????????
- Equal ??????????????
- Overflow ????????????
- Interrupt enable/disable ????/??
- Supervisor ????CPU?????????
- ????????
15Program Status Word - Example
- Motorola 68000s PSW
- System Byte User Byte
- Interrupt Mask
- Supervisor Status
- Trace Mode
15 14 13 12 11 10 9 8 7 6
5 4 3 2 1 0
T S I2 I1 I0
X N Z V C
16Other Registers
- May have registers pointing to
- Process control blocks (see O/S)
- ?????(PCB)
- Interrupt Vectors (see O/S)
- ????
- N.B. CPU design and operating system design are
closely linked - CPU?????????????
17Example Register Organizations
18Instruction Cycle
An Instruction cycle includes the following
subcycles ???????????
19Indirect Addressing Cycle
- May require memory access to fetch operands
- ?????????????????
- Indirect addressing requires more memory accesses
- ??????????????
- Can be thought of as additional instruction
subcycle - ???????????????
20Instruction Cycle with Indirect
21Instruction Cycle State Diagram
22Data Flow (Instruction Fetch)
- Depends on CPU design
- ??????,??????????CPU???
- In general
- Fetch ?????
- PC contains address of next instruction
- ??PC????????????
- Address moved to MAR ??????MAR
- Address placed on address bus ????????
- Control unit requests memory read ?????????
- Result placed on data bus, copied to MBR, then to
IR - ?????????????MBR,?????IR
- Meanwhile PC incremented by 1 ??PC?1
23Data Flow (Fetch Diagram)
2
3
1
6
4
5
24Data Flow (Indirect Cycle)
- IR is examined ??????????IR???
- If indirect addressing, indirect cycle is
performed - ??????????????,?????????
- Right most N bits of MBR transferred to MAR
- MBR???N????????,????MAR
- Control unit requests memory read
- ???????????
- Result (address of operand) moved to MBR
- ??????????????MBR
- op-code address
- instruction format
25Data Flow (Indirect Diagram)
2
1
3
26Data Flow (Execute Cycle)
- May take many forms ??????????
- Depends on instruction being executed
- ??????????
- May include
- Memory read/write ?????
- Input/Output I/O?????
- Register transfers ????????
- ALU operations ALU??
27Data Flow (Interrupt Cycle)
- Current PC saved to allow resumption after
interrupt - PC??????????,???????CPU????????
- Contents of PC copied to MBRPC??????MBR
- Special memory location (e.g. stack pointer)
loaded to MAR ????????????????MAR - MBR written to memory ?MBR????????
- PC loaded with address of interrupt handling
routine - ??????????PC
- Next instruction (first of interrupt handler) can
be fetched - ??????????????????
28Data Flow (Interrupt Diagram)
2
3
5
1
4
29Pipelining ????
- Laundry Example
- Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold - ??4???????????
- Washer takes 30 minutes
- ??30??
- Dryer takes 40 minutes ?40?
- Folder takes 20 minutes ?20?
30Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
- Sequential laundry takes 6 hours for 4 loads
- If they learned pipelining, how long would
laundry take?
31Pipelined Laundry
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
- Pipelined laundry takes 3.5 hours for 4 loads
32Pipelining Lessons(1)
- Pipelining doesnt help latency of single task,
it helps throughput of entire workload - Pipeline rate limited by slowest pipeline stage
- Multiple tasks operating simultaneously
- Potential speedup Number pipe stages
- Unbalanced lengths of pipe stages reduces speedup
- Time to fill pipeline and time to drain it
reduces speedup
6 PM
7
8
9
Time
T a s k O r d e r
33Pipelining Lessons(2)
- ??????????????,
- ???????????????
- ?????????????????
- ????????????
- ????? ?????
- ??????????????
- ??????????????????????
34Instruction Pipelining
- Similar to assembly line in manufacturing plants
- Products at various stages can be worked on
simultaneously - ? Performance improved
- ??????,??????????????????????,????????
- First attempt 2 stages ???????2?
- Fetch ???
- Execution ??
35Prefetch
- Fetch accessing main memory ???????
- Execution, usually there are times, does not
access main memory - ???????????????
- Can fetch next instruction during execution of
current instruction - ????????????????
- Called instruction prefetch ??????
- Ideally instruction cycle time would be halved
- (if durationF durationE )
- ????????????
36Improved Performance(1)
- But not doubled ??????????
- Fetch usually shorter than execution
- ???????????
- Conditional branch makes next address
unknown???????????????????????
???????????????????
37Two Stage Instruction Pipeline
38Improved Performance (2)
- Reduce time loss due to branching by guessing
- ??????????????????
- Prefetch instruction after branching instruction
- ??????????????????
- If not branched
?????? - use the prefetched instruction. ??????
- else
???? - discard the prefetched instruction ??????
- fetch new instruction
??????
39Pipelining
- Add more stages to improve performance
- ????????????????????
Instruction Cycle State Diagram
40Pipelining
- More stages ? more speedup
- FI Fetch instruction ???
- DI Decode instruction ????
- CO Calculate operands ???????
- FO Fetch operands ????
- EI Execute instructions ????
- WO Write result ???
- Various stages are of nearly equal duration
- ?????????
- Overlap these operations ?????????
41Timing of Pipeline
42Speedup of Pipelining (1)
- 9 instructions 6 stages
- w/o pipelining __ time units
- w/ pipelining __ time units
- speedup _____
- Q 100 instructions 6 stages, speedup ____
- Q ? instructions k stages, speedup ____
- Can you prove it (formally)?
43Pipelining - Discussion
- Assume all stages are needed in one instruction
- e.g., LOAD WO not needed
- ????????????????,???LOAD???????WO??
- Assume all stages can be performed in parallel
- e.g., FI, FO, and WO ? memory conflicts
- ???????????,???FI?FO?WO???
- Timing is set up assuming all stages are needed
by each instruction - ? Simplify pipeline hardware
- ??????????,???????????????????????
- Assuming no conditional branch instructions
- ??????????
44Limitation by Branching
- Conditional branch instructions can invalidate
several instruction prefetches ???????????? - In our example (see next slide)
- Instruction 3 is a conditional branch to
instruction 15 - Next instructions address wont be known till
instruction 3 is executed (at time unit 7) - ??3??????????????
- pipeline must be cleared
- No instruction is finished from time units 9 to
12 - performance penalty
- ???9-12????????,??????
45Branch in a Pipeline
46(No Transcript)
47Limitation by Data Dependencies
- Data needed by current instruction may depend on
a previous instruction that is still in pipeline - ??????????????????????????
- E.g., A ? B C
- D ? A E
48Limitation by stage overhead
- Ideally, more stages, more speedup
- ?????,??????,?????
- However,
- more overhead in moving data between buffers
- ???????????????
- more overhead in preparation and delivery
functions - ??????????????
- more complex circuit for pipeline hardware
- ???????????
49Pipeline Performance
- Cycle time ??max?id ?md ????
?m maximum stage delay ????? knumber of
stages ????? dtime delay of a latch
????
50Pipeline Performance
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
Next PC
MUX
Next SEQ PC
Next SEQ PC
Zero?
RS1
Reg File
MUX
Memory
RS2
Data Memory
MUX
MUX
Sign Extend
WB Data
Imm
RD
RD
RD
51Pipeline Performance
- Time to execute n instructions without
pipelining T1 nk? ?????n?????? - Execute n instruction time with pipelining
Tkk(n-1)? ????n?????? - Speedup ???SkT1/Tknk? /k(n-1)? nk
/k(n-1)
52Pipeline Performance
- Speedup of k-stage pipelining compared to without
pipelining - Q ? instructions k stages, speedup ____
53(No Transcript)
54Example
- ???????????????????????????4????.???4????????,????
???????t,4?????????????
???
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
55Example
- ????T1 nk? 44 ?t 16?t Tkk(n-1)?
4(4-1) ?t 7 ?t SkT1/Tk16/7 - ?????????????T,??????,???????,?? T/4
???????????,??????4?
56???????
- ??????????????????????
- ??????????????????????????
- ?????
- ?????
57???????
- ??????????????????,????????????????????????t
,??? - ?????????????????????,??????????,??????????,??,??
?????????????
58???????
- ??????????????????????????????????????????????????
?????,?
59???????
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
t
1 2 3 4 5 6 7
60???????
61???????
??
??
??
??
t
t
3t
t
1
??
2
3
4
1 2
??
3 4
1 2 3
??
4
1 2 3
??
4
t
T
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15
62???????
- ??n???,??t??1, t??2, t??3,n1000,??????????n???
?????? - ????
- ?????
- ???????????
63???????
- ??????????
- ? t1t2t3t4?,???????10???????,??????????
- ? t1t2, t32t1, t44t1?,???????
??
??
??
??
t1
t2
t3
t4
64Dealing with Branches
- Multiple Streams ????
- Prefetch Branch Target ??????
- Loop buffer ????
- Branch prediction ????
- Delayed branching ????
65Multiple Streams????
- Have two pipelines, Prefetch each branch into a
separate pipeline, Use appropriate pipeline - ??????????????????
- Problems
- Leads to register memory contention delays
- ????????????????
- Multiple branches (i.e., additional branch
entering pipelines before original branch
decision made) lead to further pipelines being
needed - ?????????????????
- Can improve performance, anyway
- e.g., IBM 370/168 ??????,???????
66Prefetch Branch Target??????
- Target of branch is prefetched in addition to
instructions following branch - ?????????????,
- ?????????????
- Keep target until branch is executed
- ????????????????
-
- Used by IBM 360/91
67Loop Buffer (1)????
- Small, very fast memory ????????
- Maintained by fetch (IF) stage of pipeline
- ?????????
- Contains the n most recently fetched instructions
in sequence ??n?????????? - If a branch is to be taken ?????
- Hardware checks whether the target is in buffer
- If YES then ??????????,??
- next instruction is fetched from the buffer
- ?????????????
- else fetch from memory ??,?????
68Loop Buffer (2)
- Reduce memory access time ????????
- Very good for small loops or jumps
- ?????????????
- If buffer is big enough to contain entire loop,
instructions in the loop need to be fetched from
memory only once at the first iteration - ?????????,???????????,???????????????????????????
?????? - Used by CRAY-1
69Loop Buffer Diagram
70Branch Prediction (1)????
- Predict whether a branch will be taken
- ??????????????
- If the prediction is right ????
- ? No branch penalty ?????
- If the prediction is wrong????
- ? Empty pipeline ????
- Fetch correct instruction ?????
- ? Branch penalty ????
71Branch Prediction (2)
- Predict techniques
- Static
- Predict never taken ???????
- Predict always taken ??????
- Predict by opcode ???????
- Dynamic
- Taken/not taken switch ??/?????
- Branch history table ?????
72Branch Prediction (3)
- Predict never taken
- Assume that jump will not happen ???????
- Always fetch next instruction ????????
- 68020 VAX 11/780
- VAX will not prefetch after branch if a page
fault would result (O/S v CPU design) - ???????????????,????????
- Predict always taken
- Assume that jump will happen ??????
- Always fetch target instruction ????????
- More than 50 ?????????50
73Branch Prediction (4)
- Predict by Opcode ??????????????
- Some instructions are more likely to result in a
jump than others - ???????????????
- Decision based on the opcode of the branch
instruction. - ???????????
- Can get up to 75 success ?????75
74Branch Prediction (5)
- Taken/Not taken switch
- One or more bits can be associated with each
conditional branch instruction that reflect the
recent history of the instruction. - ?????????????????????.?????????????????????????
??????????????? - Good for loops ?????
75???????????
- ??1????
- ?? ??
- 1 1
- 1 1
- 1 1
- 1 0
- 0 1
- 1 1
- 1 1
- 1 0
- 0 1
- 1 1
- 1 1
- 1 0
??2???? ?? ??1 ??2 1 1 1 1 1 1 1 1 1 1 0 1 1
1 0 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1
0 1
For ( . ) xxxxx For ( . ) xxxxx For (
. ) xxxxx
76Branch Prediction State Diagram
77Branch Prediction State Diagram
78Branch Prediction State Diagram
79Branch Prediction Flowchart
- Taken/Not taken switch
- Use two bits recording history
80??
81Dealing With Branches
- ??????????
- ????????,????????????
- ??????????????
- ??????????,????????
- ????????!
- ?????????(branch history table)????????
- ??????????(branch target buffer)
- ???????????????
82Dealing With Branches
- Branch history table ?????
- A small cache memory associated with the
instruction fetch stage of the pipeline. - ?????????????????????????????
- There are three elements
- ???????????
- The address of a branch instruction ???????
- Target instruction / target address ????/??
- The state of use the instruction ????????
83Predict Never Taken Strategy
- ?11.17??????????????????????
- ??????????,??????????????????????????????????????,
?????????????(????????)?
84Dealing With Branches
85Branch History Table Strategy
- ??????????????????????????????????????????????????
?,??????????????????????,????????????????????????
?,??????????????? - ????????,?????????????????,???????????????????????
???,????????????????????????????????,????????,????
????????
86Dealing With Branches
87Delayed Branch ????
- Rearrange instructions
- ??????????
- Do not take jump until you have to
- ??????????????????
- ????????,?????????????????????,????????,?????????
???????????????????? - ?????????????????????
88Normal and Delayed Branch
- Address Normal Delayed Optimized
- 100 LOAD X,A LOAD X,A LOAD X,A
- 101 ADD 1,A ADD 1,A JUMP 105
- 102 JUMP 105 JUMP 106 ADD 1,A
- 103 ADD A,B NOOP ADD A,B
- 104 SUB C,B ADD A,B SUB C,B
- 105 STORE A,Z SUB C,B STORE A,Z
- 106 STORE A,Z
89Use of Delayed Branch
90Use of Delayed Branch
91Use of Delayed Branch
1 2 3 4 5
6 7
92Use of Delayed Branch
1 2 3 4 5
6 7
93Scheduling the Branch-delay Slot
- ADD R1,R2,R3
- IF R20 THEN
- Delay slot
Becomes
94Scheduling the Branch-delay Slot
- SUB R4,R5,R6
- ADD R1,R2,R3
- IF R10 THEN
- Delay slot
- SUB R4,R5,R6
- ADD R1,R2,R3
- IF R20 THEN
- SUB R4,R5,R6
Becomes
95Scheduling the Branch-delay Slot
- ADD R1,R2,R3
- IF R10 THEN
- Delay slot
- SUB R4,R5,R6
- ADD R1,R2,R3
- IF R10 THEN
- SUB R4,R5,R6
Becomes
96Intel 80486 Pipelining(1)
- Fetch
- From cache or external memory
- ?Cache?????????
- Put in one of two 16-byte prefetch buffers
- ????16????????????
- Fill buffer with new data as soon as old data
consumed - ?????????????????????????
- Average 5 instructions fetched per load
- ?????5???
- Independent of other stages to keep buffers full
- ??????????????????????
97Intel 80486 Pipelining(2)
- Decode stage 1
- Opcode address-mode info
- ??????????
- At most first 3 bytes of instruction
- ????????????????????3??
- Can direct D2 stage to get rest of instruction
- ??D2?????????
- Decode stage 2
- Expand opcode into control signals
- ??????????ALU?????
- Computation of complex address modes
- ????????????
98Intel 80486 Pipelining(3)
- Execute ??
- ALU operations, cache access, register update
- ??ALU??? cache????????
- Writeback ??
- Update registers flags
- ?????????????????????
- Results sent to cache bus interface write
buffers - ????????cache? ??/??? ?????
9980486 Instruction Pipeline
100Pentium 4 Registers
101EFLAGS Register
102Control Registers
103MMX Register Mapping
- MMX uses several 64 bit data types
- MMX???????64?????
- Use 3 bit register address fields
- ??3????????
- 8 registers ??8?MMX??????
- No MMX specific registers ?????MMX???
- Aliasing to lower 64 bits of existing floating
point registers - ????????64?(??)????8?MMX???
104MMX Register Mapping Diagram
105Pentium Interrupt Processing
- Interrupts generated by hardware at random times
????????? - ???????????????
- Maskable ?????
- Nonmaskable ??????
- Exceptions generated from software and provoked
by the execution of an instruction - ??????????????
- Processor detected ???????
- Programmed ????
106Pentium Interrupt Processing
- Interrupt vector table ?????
- Each interrupt type assigned a number
- ??????????????
- Index to vector table ?????????
- 256 32 bit interrupt vectors?????32256
- 5 priority classes
- ???????????5??
107Pentium Interrupt Processing(1)
- Interrupt Handling ????
- the current stack segment register and the
current extended stack pointer(ESP) register are
pushed onto the stack - ???????????,?????????????????????????????
- 2. EFLAGS register is pushed on to stack
- EFLAGS????????????
108Pentium Interrupt Processing(2)
- 3. Interrupt(IF) and trap(TF) flags are cleared
- ????????????
- 4. CS pointer and IP are pushed
- ???????????????????????
- 5. Error code is pushed
- ??????????.???????????
- 6. The interrupt vector contents are fetched and
loaded into the CS and IP or EIP registers - ??????????CS?IP(?EIP)???