Instruction Set Principles - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Instruction Set Principles

Description:

Graph coloring ... (a heuristic or approximate version is used since graph coloring is NP complete) ... Register optimization with graph coloring: ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 29
Provided by: rfox
Category:

less

Transcript and Presenter's Notes

Title: Instruction Set Principles


1
Instruction Set Principles
  • The instruction set is the portion of the
    architecture that is visible to the programmer
    (or compiler writer) here we look at various
    issues involved in designing an instruction set,
    and then look at the MIPS architecture
  • Register Issues
  • compilers will reserve registers for temporary
    variables, parameter passing, commonly used
    variables
  • are instructions 2-operand or 3-operand?
  • how many of the operands may be memory addresses
    or is it a load-store instruction set?
  • Memory issues
  • how are memory addresses interpreted (big or
    little endian)?
  • this is an issue only if accesses can be made to
    sizes smaller than a word although this is
    typical in many computers, so it must be
    addressed
  • how are memory addresses specified?
  • what addressing modes
  • how many addresses modes and will complex ones be
    allowed?

2
Comparison of Register Set Types
3
Addressing Modes
  • Instruction operands will reference
  • a constant (immediate datum)
  • a register
  • a location in memory
  • or some combination of these
  • Numerous addressing modes can significantly
    reduce the instruction count of a program
  • however, many of these addressing modes will add
    to the CPI of the instruction due to the time it
    takes to compute the effective address
  • Types of Modes
  • PC-relative addressing to specify an
    instructions location as used in branches
  • Data addressing modes (as shown on the next
    slide)
  • Design issues
  • Displacement
  • how big is the displacement?
  • most of the SPEC benchmarks use displacements of
    no more than 15 bits
  • Immediate data
  • How often are these modes used and how big should
    the displacement or datum be?
  • see figures B.7, B.8, B.9 and B.10 for answers to
    these questions (pages B-11 B-13)

4
(No Transcript)
5
Types of Instructions
  • Types of instructions
  • arithmetic (integer) and logical operations
  • floating point (, -, , /)
  • notice that we are separating int and FP
    operations this is because they will be handled
    in different hardware and using different
    register sets even though their format will
    largely be the same
  • convert operations will also go into these two
    categories to convert from int to fp and fp to
    int
  • data transfer (load, store)
  • control (branch, jump, proc call/return, trap)
  • OS related (OS system call, v.memory, others)
  • strings (move, compare, search)
  • graphics (pixel operation, compress/decompr)
  • we wont concern ourselves with these last three
    categories in this course
  • See figure B.13 (page B-16) to see the most
    common types of instructions as used in the 80x86
    architecture

6
Branch Instructions
  • We focus on branch instructions because there are
    some design issues to consider
  • these will usually be PC-relative branching
  • adding an offset to the PC
  • two advantages of PC-relative branching over
    absolute branching
  • displacement is usually small allowing for
    smaller instructions
  • displacement does not necessarily need to be
    known at compile time allowing for easy use of
    run-time loaded libraries
  • Types of branches
  • procedure calls and returns
  • usually the return address might be stored in the
    run-time stack or in a register
  • parameters may also be passed by run-time stack
    or through register windows
  • conditional branches
  • jumps (unconditional)
  • Conditional branches are the most common
  • see figure B.14 on page B-17 where 75-82 of all
    branches are conditional

7
Branching Questions
  • What is the form of condition?
  • complex conditions can be time consuming
  • some architectures limit the condition to be a
    simple equality/inequality test
  • often, just a test of a register value against 0
  • in other cases, the condition might involve
    condition codes
  • When is the comparison performed?
  • with the branch statement or prior to the branch
    statement
  • that is, is the instruction a compare and branch,
    or are they separated?
  • What is the distance of the branch?
  • Figure B.15 shows that most branch distances are
    lt 10 bits in length

8
Encoding an Instruction Set
  • Design issues
  • variable-length vs fixed-length (or hybrid) (see
    Figure B.18 on page B-22)
  • deriving op codes for each instruction
  • although this is trivial, there are some concerns
    should we have several ADD instructions, each
    with a different type of operand expected or
    specify operand type in the instruction (for
    instance, AddInteger vs AddFP vs AddDouble, etc)
  • how are operand addresses specified?
  • will there be separate bytes of the instruction
    to specify this or will it be part of the
    operand?
  • how many operands can an instruction reference?
  • ideally, we want enough bits to specify 2 or 3
    operands, by memory location, register,
    displacement
  • Concerns
  • want as many registers as possible BUT the more
    registers, the more bits needed in the
    instruction to address between them
  • many addressing modes are seldom used
  • should they be omitted?
  • instruction sizes should be based on bytes (e.g.,
    8, 16, 24, 32 bits)
  • as instruction sizes increase, so do the size of
    the programs!

9
Compiler Optimizations
  • In order to support the increasingly complex
    hardware, we need compiler support in the form of
    machine code optimizations, here are some
    examples
  • High-level optimizations on source code
  • example procedure in-lining, loop transformation
  • Local optimizations on single-lines of code
  • example change the order of references in a
    block or expression
  • Global optimizations extend local across branches
  • example loop unrolling
  • Register allocation to optimize the storage of
    variables in registers and minimize memory
    fetches
  • Machine-dependent optimizations
  • take advantage of the specific architecture
  • see Figure B.19, page B-25

10
Optimizations Continued
  • Two example optimizations
  • Sub-expression Elimination
  • take a sub-expression that is used more than
    once, store the result as computed the first time
    in a register and reuse it rather than recompute
    the expression again, or store it in memory only
    to be retrieved again
  • Graph coloring
  • an algorithmic technique to determine how values
    can be distributed into registers (a heuristic or
    approximate version is used since graph coloring
    is NP complete)
  • One problem with optimizations performed in
    segments is phase-ordering
  • a transformation at one level may directly affect
    possible optimizations at another level
  • for example, expanding a procedure at the
    high-level without knowing the size of the
    procedure
  • another example, register allocation is performed
    near the end of the optimization techniques but
    sub-expression elimination requires the
    allocation of registers

11
Introduction to MIPS
  • MIPS is a RISC architecture derived from previous
    RISC architectures
  • designed for pipeline efficiency and efficiency
    as a compiler target
  • General-purpose register set and load-store
    architecture
  • 32 64-bit general purpose (integer) registers
  • labeled R0, , R31, where R0 is always 0
  • values are sign extended when loaded into a
    register if they are not of the right size
  • 32 64-bit floating point registers
  • labeled F0, , F31 where only half the register
    is used for floats
  • Addressing modes
  • support displacement and immediate addressing
    only
  • direct addressing can be accomplished by using R0
    as the displacement
  • register indirect can be accomplished by using a
    base of 0
  • displacements of 12-16 bits and immediate data of
    8-16 bits
  • memory is byte addressable and 64-bit addresses
    are used

12
More on MIPS
  • Instructions
  • less than 100 operations (op code requires 7
    bits, however we will reduce this to 6 bits by
    using one op code for all integer ALU operations)
  • 32-bit instructions
  • 3 instruction formats used (shown on the next
    slide)
  • I-type is used for loads, stores, conditional
    branches and ALU operations that use immediate
    data
  • R-type are used for all other ALU operations and
    FP operations
  • J-type are only used for jump, jump and link
    (procedure call), trap, return
  • Data types are 8, 16, 32 and 64 bit integer and
    32 and 64 bit floating point
  • no character or string types (characters treated
    as ints, strings as arrays of ints)
  • Immediate data and displacements are limited to
    16 bits except for Jump instructions in which
    case displacements are limited to 26 bits

13
MIPS continued
3 operand instructions are available as long as
all operands are in registers (R-type) or 2
registers and immediate datum (I-type) immediate
datum (which is also used for displacement offset
s) is limited to 16 bits (2s complement) but
extended to 32 bits funct is the specific type
of ALU or FP function
14
MIPS Architecture in 5 Stages
15
Step-by-Step Description
  • IF Stage
  • PC sent to Instruction memory (cache) to fetch
    next instruction
  • PC incremented by 4
  • MUX is controlled by the previous instructions
    result (if it was a branch) and the PC is either
    replaced by the PC 4 or the new branch location
  • instruction moved into the IR
  • ID Stage
  • Instruction examined as follows
  • bits 6..10 denote one source register (for I-type
    and R-type)
  • bits 11..15 denote the other source register (for
    R-type)
  • bits 16..32 store an immediate datum or
    displacement this value is sign extended to 32
    bits

1st register value placed in A 2nd register value
placed in B Sign extended value placed in Imm
16
Architecture continued
  • EX Stage
  • if ALU (or FP) operation
  • the two register values are sent to the ALU and
    the proper circuit is selected
  • the result is passed on to the next stage
  • if branch
  • compare register to 0
  • in ALU, use the adder to compute new PC (PC
    displacement)
  • if load or store
  • use the adder in the ALU to compute the datums
    address (base displacement) by adding the given
    registers value to the displacement value, send
    this address to the next stage

17
Architecture continued
  • MEM stage
  • if load or store, perform the data access given
    the memory location as computed in the EX stage
  • on a load, the new datum is stored temporarily in
    LMD
  • if branch, finish the condition using the MUX by
    determining whether to replace the PC with PC 4
    or the new address
  • WB stage
  • if load or ALU operation, result is in LMD, use
    the MUX to write the result to the proper
    register in the register file (see the ID stage)

18
Comments on the MIPS Architecture
  • The simplified nature of MIPS means that many
    tasks will require more than a single operation
    (compared to more complex instruction sets that
    could accomplish the task with 1 operation)
  • load registers before performing an ALU operation
    on the values
  • two instructions to perform indirect memory
    access
  • load the pointer from memory into a register,
    then load the datum
  • two or more instructions to perform scaled or
    indexed modes
  • The CPI of MIPS operations is less than in other
    instruction sets making up for this
  • all operations have a CPI of 4 except Loads and
    ALU operations which have a CPI of 5 (because
    they must write their results to registers in the
    WB stage)
  • The static size of all MIPS operations makes it
    easier to deal with pre-fetching and pipelining

19
More Comments
  • The architecture requires the following hardware
    elements to implement
  • the ALU should have all integer operations
    (arithmetic, logic)
  • we address floating point operations later in the
    semester
  • an additional adder for the IF stage
  • several temporary registers
  • IR, A, B, Imm, NPC, ALUOutput, LMD
  • multiplexors to select
  • what to do after a condition is evaluated
  • whether a computed value is to be used later in
    temporary registers A or B
  • whether to use a register value or the immediate
    datum
  • multiplexors in the ALU to select the output
    based on the specific ALU operation (not shown in
    the figure)
  • multiplexors in the register file to select which
    register to send on to the A or B temporary
    registers, and a demultiplexor to pass along the
    LMD value into one of the registers (not shown in
    the figure)

20
Sample Problem 1
  • For each of the following compiler optimizations
    below, explain whether the optimization will
    provide a CPU performance increase because it
    will reduce overall program CPI, reduce IC, or
    reduce both
  • Code motion
  • Register optimization with graph coloring
  • Procedure integration
  • Global common subexpression elimination
  • Copy propagation
  • Solution
  • code motion in removing redundant instructions
    from a loop, we are lowering the IC of the program

21
Sample Problem 1 Solution continued
  • Register optimization with graph coloring
  • by better matching variables to registers, we can
    reduce the number of loads and stores required,
    so we are lowing IC
  • Procedure integration
  • we replace procedure calls/returns with the code
    itself reducing IC
  • we may also reduce run-time stack communication
    so that accessing parameters is actually
    accessing values in registers, thus reducing IC
    even more as well as the overall CPI since there
    will now be fewer memory operations
  • Global common subexpression elimination
  • reusing a previously computed value removes later
    computation, so this lowers IC
  • Copy propagation
  • if the copy propagation is the replacement of a
    variable with an immediate datum (as in changing
    x y with x 5), this will lower CPI because we
    are changing memory accesses with register
    accesses
  • however, if the copy propagation is the
    replacement of an expressions evaluation with an
    immediate datum (as in changing x y z with x
    5), then we are lowering IC

22
Sample Problem 2
  • Using the MIPS instruction set, write a set of
    code to compute the average of the elements in an
    int array
  • assume the array starts at memory location 50000
  • assume the variable storing number of items of
    the array is at memory location 10000
  • Store the resulting float value at location 10004

23
Sample Problem 3
  • Using MIPS, write a set of code that will find
    the largest and smallest items in an array
  • Array starts at a location pointed to by register
    R5
  • Array contains 500 elements
  • Store the min in R1 and the max in R2

24
Sample Problem 4
  • Consider adding a register-memory ALU operation
    to MIPS and replace the two instruction sequence
  • LW R1, 0(R2)
  • ADD R3, R3, R1
  • with
  • ADD R3, 0(R2)
  • Assume this will cause an increase to the clock
    cycle of 5 and using the gcc benchmark (see
    figure B.27), what percentage of loads must be
    eliminated for the new machine to achieve the
    same performance?
  • CPI remains the same, so to achieve the same
    performance, CPU timeold must equal CPU timenew
    or
  • ICo CPIo Clock Cycle Rateo ICn CPIo
    Clock Cycle Rateo 1.05
  • Solving for ICn gives 1 / 1.05 .952 or we must
    remove 4.8 of the instructions
  • We must remove enough Loads to reduce the overall
    IC by 4.8
  • Since Loads make up 25.1 of gccs instruction
    mix, we must remove 4.8 / 25.1 19.1 of the
    Loads

25
Sample Problem 5
  • Solution
  • Average loads (26.5 25.1) / 2 25.8
  • Average stores (10.3 13.2) / 2 11.75
  • Average conditional branches (9.3 12.1) / 2
    10.7
  • Average jumps (includes returns and calls)
    2.95
  • Average ALU (50.1 47.2) / 2 48.65
  • CPI (25.8 11.75) 1.4 10.7 60 2.0
    10.7 40 1.5 2.95 1.2 48.65 1.0
    1.2402
  • Compute MIPS effective CPI for the gap and gcc
    benchmarks (average their instruction mix)
  • assume 60 of conditional branches are taken and
    all miscellaneous instructions are ALU
    instructions
  • use the following CPI
  • ALU instructions 1.0
  • Load-stores 1.4
  • Conditional Branches taken 2.0
  • Conditional Branches not taken 1.5
  • Jumps 1.2

26
Sample Problem 6
  • Consider the following two changes to the MIPS
    architecture
  • 1. Move the MUX in the MEM stage into the EX
    stage to complete branches there, reducing branch
    CPI to 3
  • 2. For ALU operations, perform the write back to
    register in stage 4 instead of stage 5 by adding
    another MUX
  • Assuming that these changes require an increase
    in clock cycle time from 1 ns to 1.1 ns, is this
    worth doing? Use the average integer benchmark
    values in figure B.27
  • Solution
  • Recall that CPI was 5 for loads and ALU and 4 for
    stores and branches, now CPI is 4 for ALU and 3
    for branches
  • CPU Time old 5 (.26 .47) 4 (.10
    .16) CCT old 4.69 CCT old
  • CPU Time new 5 .26 4 (.47 .10) 3
    .16 CCT old 1.1 4.466 CCT old
  • So the changes provide a 4.69 / 4.466 1.05
    speedup or a 5 speedup

27
Sample Problem 7
  • Architects are considering whether to add an
    autoincrement/decrement addressing mode to MIPS
  • the consequences of such a change are that most
    programs would require fewer instructions (IC
    would be lowered) because the new mode would
    combine the memory access and the change to the
    offset in one operation
  • but would require a longer EX stage and thus a
    longer clock cycle time
  • How should the architects decide whether to
    include this mode?
  • Consider figure B.7 on page B-11 which shows that
    benchmarks use displacement often for data
    access (32-55)
  • each displacement access will probably be
    followed by altering the offset (an add or
    subtract)
  • assume the fraction of loads/stores is 40 and
    that 40 of these use displacement, and that 80
    of displacement addressing operations require a
    change to the offset

28
Solution
  • Without autoincrement/decrement, then 80 of the
    loads and stores are followed by an ALU operation
  • there are 40 loads and stores
  • of the loads and stores, 40 use displacement
  • of these, 80 increment or decrement the
    displacement immediately after the load or store,
    so we have
  • .4 .4 . 8 .128 (12.8) of the operations
    are followed by an add or subtract
  • these ALU operations can now be removed, so IC
    decreases by 12.8
  • Recall CPU Time IC CPI Clock Cycle Time
  • In order for the new mode to be worthwhile, the
    decrease in IC must be at least as much as the
    increase in Clock Cycle Time
  • if so, then CPU Time decreases, otherwise CPU
    Time increases and the new addressing mode is not
    worthwhile
Write a Comment
User Comments (0)
About PowerShow.com