Week 5 Lecture slides - PowerPoint PPT Presentation

About This Presentation
Title:

Week 5 Lecture slides

Description:

Cosc 3P92 Week 5 Lecture s Voters quickly forget what a man says. Richard M. Nixon (1913-1994) Former U.S. President – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 77
Provided by: coscBrock5
Category:

less

Transcript and Presenter's Notes

Title: Week 5 Lecture slides


1
Cosc 3P92
  • Week 5 Lecture slides

Voters quickly forget what a man says. Richard
M. Nixon (1913-1994) Former U.S. President
2
Hardware components MIC(overview)
  • MAR and MDR are registers which latch the
    addresses and data prior to processing

3
Hardware components MIC (overview)
  • Translate byte address 0, 1, 2, 3 to 4 byte
    words.
  • Shift 2 bits left.
  • Causes word 0, 1, 2, 3 to be addressed.
  • Alignment of words.

4
Hardware components MIC (overview)
  • Each micro instruction controls
  • register enables
  • bus enables
  • ALU
  • Memory
  • Next Micro instruction address

5
Hardware components MIC (overview)
6
Memory control
  • MAR - memory address register
  • CPU writes addresses of memory to read, write
  • MBR - memory buffer register
  • contains data for write or read
  • both act as latches to hold addr, data until
    memory finished using them.

7
Control unit
  • main functions of a control unit
  • - instruction interpretation
  • - instruction sequencing
  • the control unit is a finite-state machine.

8
Execution unit
  • An execution unit consists of
  • a register section
  • an ALU
  • some dedicated hardware or firmware

9
Data transfer within a CPU
  • A single-bus architecture
  • To compute R2 lt R0 R1
  • 1. A lt R0,
  • 2. B lt R1,
  • 3. R2 lt AB

10
Data transfer within a CPU
  • A two-bus architecture
  • To compute R2 lt R0 R1
  • 1. Buffer lt R0 R1 (via Bus A and Bus B),
  • 2. R2 lt Buffer (via either Bus A or Bus B).

11
Data transfer within a CPU
  • A three-bus architecture
  • To compute R2 lt R0 R1
  • 1. R2 lt R0 R1 (via Bus A, Bus B and Bus C).

12
Design of control units
  • Hardwired approach
  • The control unit is treated as a synchronous
    (i.e., clocked) sequential circuit and is
    implemented as a hardwired state machine.

13
Microprogramming
  • Use of memory to implement the control unit
  • Instructions are implemented as sequences of
    instructions stored in control memory
  • Each machine language instruction is interpreted
    by circuitry, and executed using sequences of
    microprogram instructions
  • Micro-programs are much like assembled code,
    except
  • direct mapping between instruction fields and
    hardware components of the CPU.
  • control fields are specified.
  • timing is critical parallelism can be exploited.

14
Microprogramming
  • What is being controlled?
  • data paths inter-register connections
  • control points hardware enabling lines which
    govern register-to-register communications
  • idea is that we can control the operation of ALU
    and micro-control unit using combinations of
    control fields encoded in micro-instructions

15
Microprogramming
  • Each control point specifies a micro-operation
  • All micro operations which may be executed in
    parallel can be specified in a single micro
    instruction.
  • Factors which determine parallel operations.
  • Buses must only have 1 input active at a time.
  • Registers can be either read/written
  • Not both at the same time.

16
Microprogramming
  • Basic microinstruction formats Over heads

17
Data path
  • 32-bit registers (none are user-accessible)
  • B bus main one to ALU
  • C bus from ALU back to registers
  • H reg contains other operand for ALU
  • loaded by performing null op on data, and sending
    it to H

18
Data path
  • ALU control 6 control lines
  • shifter 2 control
  • 1. logical shift left 8 bits
  • 2. arithmetic shift right 8 bits

19
Data path timing
  • Four sub-cycles
  • 1. control signals set up (w)
  • 2. registers loaded on B bus (x)
  • 3. ALU and shifter (y)
  • 4. results available to registers on C (z)

20
Data path timing
  • These are implicit sub-cycles they rely on
    timing of previous steps
  • Only real clock signals used
  • falling edge of clock (starts the cycle)
  • rising edge (loading from C in step 4)
  • ALU is continually processing all intermediate
    values it sees. Its output only makes sense at
    the appropriate time above (after 3)
  • Can operate and save a register in 1 clock cycle
  • load PC to B
  • inc
  • save to PC

21
Memory again
  • 2 memory buffers
  • 32 bit port MAR, MDR (read, write)
  • word addresses
  • 8-bit MBR
  • low byte from PC (read only)
  • byte addresses
  • can be loaded signed, unsigned onto B bus
  • call reads into MBR fetches
  • control
  • black arrow enable from C bus
  • white arrow enable onto B bus
  • 2 bus control
  • out B
  • in C
  • out B / in C
  • none

22
Memory again
  • MAR aligned to words (32 bits, 4 bytes) 4.4
  • Memory is available 2 cycles from when read was
    initiated
  • avail. at end of 2nd cycle, so 3rd cycle can use
    them

23
Microinstructions
  • 29 signals for data path
  • 1. 9 signals to control C bus output into
    registers
  • 2. 9 signals to enable registers onto B bus
  • 3. 9 signals for ALU, shifter functions
  • 4. 2 signals for memory W/R via MAR/MDR
  • 5. 1 signal for memory fetch via PC/MBR
  • Issues
  • may load more than 1 reg from C (9 bits)
  • but never load more than 1 reg onto B (4 bits,
    encoded will force this) --gt 4 signals.
  • Need 2 more fields for determining next m.i.
  • NextAddr (9 bits, addr space of 512)
  • conditional jumps (3 bits)

24
Microinstructions
  • Fields
  • Addr address of next micro-instruction
  • JAM determines how next m.i. selected
  • ALU ALU, shifter control
  • C which registers written from C bus
  • Mem memory functions
  • B B source (encoded)

25
Example micro-architecture Mic-1
26
Example microarchitecture Mic-1
  • sequencer executes microinstructions
  • Two tasks
  • set control signals for system
  • determine next m.i. to execute
  • control store contains m.i. for interpreting ISA
    instns.
  • each instn a 36-bit word like 4.5
  • each m.i specifies its successor
  • MPC MicroProgram Counter
  • 9-bit address of next m.i. to execute
  • MIR MicroInstruction Register
  • 36-bit m.i. being executed
  • Note that bits in MIR may directly control other
    parts of the circuit
  • eg. C

27
Mic-1 operation cycle
  • Basic ALU cycle
  • 1. set up the inputs to the ALU
  • 2. let the ALU do its computation
  • 3. store the results
  • Clock cycles for Mic-1
  • 1. MIR enabled (during subcycle w)
  • 2. MIR signals control data path (B bus note H
    always enabled) (subcycle x)
  • 3. B and H inputs are stable, and ALUs computes
    output shifter finishes N, Z bits stable
    (subcycle y)
  • 4. shifter, N, Z outputs loaded from C but into
    registers
  • rising clock edge determines end
  • MIR is reloaded and calculated at this point as
    well
  • Memory read is initiated at end too
  • Note that all the above will complete in 1 cycle
  • microinstructions can specify all these
    operations in parallel

28
Mic-1 sequencing
  • First, 9-bit next addr field copied into MPC
  • JAM inspected
  • 000 use MPC as it is
  • if JAMN (or JAMZ) set, then N bit (or Z) are ORed
    with high-bit of MPC
  • hence next address is either MPC, MPC with
    high-bit ORed with 1
  • JMPC set MBR byte ORed with low byte of NextAddr
    field
  • permits multiway jumps
  • can quickly branch to instn for just-loaded
    opcodes (ie. opcode number address in control
    store!)

29
Microinstructions and notation
  • As in assembler programming, helps to use
    higher-level notation instead of raw numeric m.i.
    fields
  • can specify everything that happens in 1 clock
    cycle
  • permits parallelism eg. prefetch next instns
  • Notation high-level, but directly translatable
    to single m.i.s
  • Examples
  • SPSP1 incr SP by 1
  • MDR SP copy SP into MDR
  • MDR SPH rd add SP and H, save in MDR, and
    initiate a read
  • SPMDRSP1 incr SP, load into both MDR, SP

30
Microinstructions and notation
  • Memory takes 2 cycles
  • MARSP rd assign value into MDR
  • (another instn)
  • memory ready now!
  • next addresses assume it is the labeled next
    m.i. after current one (unless a conditional
    jump)
  • if (Z) goto L1 else goto L2 sets JAMZ
  • L1 and L2 are same low-8 bits (set by assembler)
  • Summary of legal operations on operands

31
Example M.I. implementation IJVM
  • A stack-based virtual machine for which Mic-1 is
    designed to implement.
  • All instructions access the stack no general
    registers are used by compiler
  • eg. parameter passing 4.8
  • eg. arithmetic 4.9
  • Recall
  • JVM instruction formats 5.15
  • Java memory usage, registers 4.10
  • Complete instruction set 4.11
  • Example translated code 4.14

32
(No Transcript)
33
JVM Instruction Formats
34
Memory area of IJVM
35
IJVM Instruction Set
36
Translating Java to IJVM
37
Implementation (cont)
  • See overheads (book page 234-236)
  • Note
  • each m.i. contains address of next instn
  • micro-assembler labels all instns appropriately,
    and must put them in right control store
    addresses (equiv. to opcode)
  • the sequenced instns may reside in any free area
    of control store! Microassembler auto sets next
    address fields.
  • only explicit gotos will override this
    sequencing
  • Two parts
  • 1. fetch next byte for next instn (done at Main1)
  • 2. branch to that opcode address and carry out
    instruction
  • Fetching instructions (Main1)
  • PC always points to next instruction in Java
    application program
  • can be reset by branches (see goto5, T, F,...)
  • When Main1 executed, assumed next opcode ready.
    the fetch at Main1 is for next opcode. Hence
    instns must fetch it if necessary(eg. see bipush2)

38
Implementation (cont)
  • Example 1 iadd (pop 2 words from stack, push
    their sum)
  • iadd1 reads next-to-top word in stack (TOS
    register already contains top of stack word)
    bumps down the SP for writing result
  • iadd2 sets TOS ready for addition (put in H)
  • iadd3 add next-to-top value (read in iadd1) to
    H, update TOS, save result in MDR for writing
  • Example 2 dup (copy top stack word and push
    it)
  • dup1 incr SP pointer, copy to MAR
  • dup2 save TOS (top stack word) to new SP, write
    it
  • note cant write it in dup1, because both SP and
    MDR must be updated thru data path, and not both
    at once

39
Implementation (cont)
  • Example 3 goto offset (unconditional branch)
  • Fig 4.22
  • goto1 save addr of opcode to OPC (old PC)
  • goto2 get the 2nd byte of offset (1st byte
    already in MBR)
  • goto3 shift 1st byte left 8 bits
  • goto4 OR low byte into high byte
  • goto5 add 16-bit offset to (old) PC get next
    opcode
  • goto6 goto Main1
  • Note pause needed in goto6 (must wait 2 extra
    cycle)

40
(No Transcript)
41
Improving performance
  • 1. Faster clock, transistors, electrical circuits
  • 2. simpler organization yields shorter clock
    cycles
  • eg. get rid of (B bus) decoder
  • 3. Merge interpreter loop with microcode (pt 2)
  • 4.23, 4.24
  • saves extra cycles if done in all instns
  • significant speedup!
  • 4. Three-busses
  • 4.25, 4.26
  • reduces need for separate instns to load H reg

42
(No Transcript)
43
2 Bus v.s. 3 Bus
44
Improving performance
  • 5. Instruction fetch unit 4.27
  • in Mic-1, ALU is used to increment PC and fetch
    instns
  • this uses up instn. cycles
  • IFU can be used
  • 1. pre-fetches all instns outside of main data
    path
  • 2. pre-fetches operands if they are required,
    they are there (else garbage, but ignored anyway)

45
Fetch Unit
46
Improving performance
  • Instruction fetch unit (cont)
  • shift register always loaded with next bytes
    from memory
  • MBR1 (1 byte, as before) and new MBR2 (2 bytes)
  • values from shift reg dumped into both MBR1, MBR2
    after every instn read if needed, they are
    quickly put onto data path as reqd
  • need some fetching logic to know when to read
    more bytes into shift register, when to refresh
    MBR1, MBR2
  • IMAR separate memory addr reg (separate from
    MAR)
  • own dedicated incrementer (no need for ALU)
  • IFU must keep PC incremented properly, depending
    on instn length (if MBR1, MBR2 used)
  • branches may reset PC as well (from C)

47
Improving performance
  • Mic-2
  • A, B buses
  • IFU
  • new IJVM 4.30, See overheads
  • smaller, faster
  • MBR1 always has next opcode (due to IFU)

48
Mic-2
49
Improving performance 6. Pipelining
  • divide instn. execution into modular steps and
    carry out different steps for seql. instns
    simultaneously
  • instruction-level parallelism
  • superscalar single pipeline with parallel
    functional units
  • most instns take more than 1 cycle to complete
  • with pipelining n instns in n cycles
  • To implement it 4.31
  • add latch to A, B, C buses
  • they keep values stable during sub-cycles can
    use values in 3 sections of the data path
  • (i) loading before ALU (A, B)
  • (ii) doing ALU, shift, and loading C latch
  • (iii) storing C back into registers

50
Mic-3
51
Improving performance 6. Pipelining
  • need 3 cycles now to complete 1 instn
  • but maximum delay between all components is
    shorter (1/3) so can speed up clock
  • advantage throughput -- 3 instns can be
    processed simult.
  • all parts of data path are busy... none are idle
    (usually)
  • best analogy car factory assembly line

52
Pipelining (cont)
  • 4.32, 4.33, 4.44
  • interpreting instns in pipelined processor
    (Mic-4)
  • new sub-cycles microsteps
  • takes 3 cycles to process instn (steps i, ii, iii
    from earlier)
  • call latches A, B, C (like registers)
  • advantage 4.33 is that different stages can
    work independently of one another now
  • more stages in pipeline means higher efficiency

53
(No Transcript)
54
(No Transcript)
55
Pipelining (cont)
  • One complication memory reads
  • takes 2 cycles to get word from memory
  • hence a m.i. that uses a word in MDR must wait
    until its available
  • called a true or RAW (read after write)
    dependence
  • pipeline must stall until it is ready
  • ideally, put other m.i. instns in wait states
  • Another complication conditional branches
  • cannot predict which instn to fetch/put into
    pipeline
  • have to squash or flush pipeline when a jump
    ruins sequence of instns

56
Pipelines and branch prediction
  • unconditional branches
  • fetch unit needs to know in advance where to
    access instns
  • a jump instn. isnt decoded right away, and so
    F.U. wont know branch location until later
    called the delay slot
  • soln compiler places other executable instns in
    delay, that it knows can be executed
  • conditional branches
  • dynamic prediction carried out during run time
  • keep a running table of branched instn addresses,
    along with a branch/no branch bit
  • if branch in table, and branch bit set, then
    predict it will be taken --gt fetch it
  • can use 2 prediction bits predict its fetched
    twice, and not fetched twice (extra logic)

57
Pipelines and branch prediction
  • static branch prediction carried out during
    compile time
  • if a loop nearly always done, then have a field
    in the instn. which tells CPU that branch should
    be fetched (eg. UltraSPARC)
  • can do simulations to determine how cond.
    branches executed

58
Improving performance out-of-order exec, reg
renaming
  • instruction ops can take varying clock cycles
  • superscalar systems mean those functional units
    need more time to process their instns
  • problem cant exec one instn that requires
    results of another
  • means the pipeline stalls until register values
    are computed when subsequent instns require them.
  • soln move instruction order, so that no idle
    waiting
  • overall exec must be identical to linear order
  • dependencies
  • RAW (read after write) try to read reg before
    another instn has written it.
  • WAR (write after read) try to write before
    another has read it
  • WAW (write after write) both write simult.

59
In-order exec, in-order completion
  • decode in cyc n, exec n1, writeback n2 (except
    multiply in n3)
  • 2 instns decoded simult.
  • uses scoreboard 1 counter per reg keeping track
    of instns using it as a source or destination
  • keeps track of max regs that can be processed
    concurrently

60
Out-of-order exec, reg renaming (cont)
  • idea execute instns so long as resources are
    available, and no conflicts
  • move order of instns to permit this
  • registers are renamed automatically to reduce
    conflicts secret regs
  • eg. if a register is in conflict, rename it so
    conflict is removed.
  • copy values to original named reg later if
    required.
  • result huge performance gain (were trying to
    make pipeline maximally useful!)

61
Improving performance speculative exec
  • block a section of sequential code 4.45
  • Can increase throughput by moving instructions
    beyond their blocks
  • hoisting moving an instruction over a branch
  • speculative execution executing an instruction
    before it is known whether it will be needed
  • OK to do it so long as there is no side effect
    (eg. write to memory, trap/interrupt)
  • may sometimes cause slowdown if spec. exec
    fetches an instn from memory that isnt needed
  • otherwise, idea is to move slower instructions up
    the queue so that their processing can occur in
    the interim
  • some solns
  • speculative instns only fetch/exec instructions
    that are in the cache
  • poison bits dont set traps automatically wait
    until that instn actually executed, and if a
    poison bit is set, then set the trap

62
Speculative exec
63
Example 1 Pentium II
  • 1. Fetch/decode 4.46
  • fetches instns and breaks them into m.i.s
  • 2.dispatch/exec
  • takes m.i.s and execs them
  • 3. retirement unit
  • completes exec, stores reg values (speculative
    exec)
  • 1, 2, 3 above act as high-level pipeline
  • ROB (reorder buffer) table of m.i.s to execute
  • Fetch/decode 4.47
  • 7-stage pipeline
  • multiple formats, sizes means instn decoding is
    involved
  • analyzes instns to determine size,
    branch-prediction
  • usually between 1 and 4 m.i.s per ISA instn.
  • uses reg renaming
  • both static, dynamic branch prediction used
  • Dispatch/exec 4.48
  • 5 m.i.s can be execd at once

64
P2-micro architecture
65
(No Transcript)
66
Example 2 UltraSPARC II
  • 4.49
  • RISC all instns are 3-register microinstns
    already
  • branch prediction (i) cache flags (ii) 2-bit
    prediction (iii) compiler directions in instns
  • tries to exec 4 instns in parallel all the time
  • instns may be executed out of order
  • 9-stage pipeline 4.50
  • split integer, float pipelines
  • int adds 2 stages (N1, N2) to keep it same as fp

67
UltraSPARC
68
UltraSPARC Pipeline
69
Example 3 picoJava II
  • 4.51
  • instn, data caches are optional
  • register file (64 entries)
  • contains top 64 words of stack
  • dribbling reg file read/written to memory when
    it gets too empty/full
  • free access, w/o accessing caches (which may
    not be used)

70
(No Transcript)
71
  • 6-stage pipeline 4.52
  • CISC instns
  • not superscalar instns fetched, retired inorder
    (unlike Pentium II)
  • no branch prediction alg (economy)

72
Folding
  • Folding 4.53, 4.54, 4.55
  • replace a set of m.i.s with one m.i.
  • looks up patterns in a table 4.55, and replaces
    with equivalent m.i.
  • only possible if operands are high in stack, in
    register file
  • huge gain in speed, like RISC performance

73
(No Transcript)
74
(No Transcript)
75
Comparing these examples
  • common features
  • all m.i.s contain opcode, 2 source regs, dest
    reg
  • 1 m.i. per cycle
  • deep pipelines
  • split instn and data caches
  • Pentium II complexity is in deconstructing its
    CISC instns into micro-operations
  • JVM complexity is in folding sets of m.i.s into
    single operations
  • UltraSparc most straight-forward to implement,
    because instns require minimal decoding (all RISC
    instructions are micro-operations already!)

76
The end
Write a Comment
User Comments (0)
About PowerShow.com