Title: Instruction Set Principles
1Instruction Set Principles
- The instruction set is the portion of the
architecture that is visible to the programmer
(or compiler writer) here we look at various
issues involved in designing an instruction set,
and then look at the MIPS architecture - Register Issues
- compilers will reserve registers for temporary
variables, parameter passing, commonly used
variables - are instructions 2-operand or 3-operand?
- how many of the operands may be memory addresses
or is it a load-store instruction set? - Memory issues
- how are memory addresses interpreted (big or
little endian)? - this is an issue only if accesses can be made to
sizes smaller than a word although this is
typical in many computers, so it must be
addressed - how are memory addresses specified?
- what addressing modes
- how many addresses modes and will complex ones be
allowed?
2Comparison of Register Set Types
3Addressing Modes
- Instruction operands will reference
- a constant (immediate datum)
- a register
- a location in memory
- or some combination of these
- Numerous addressing modes can significantly
reduce the instruction count of a program - however, many of these addressing modes will add
to the CPI of the instruction due to the time it
takes to compute the effective address - Types of Modes
- PC-relative addressing to specify an
instructions location as used in branches - Data addressing modes (as shown on the next
slide) - Design issues
- Displacement
- how big is the displacement?
- most of the SPEC benchmarks use displacements of
no more than 15 bits - Immediate data
- How often are these modes used and how big should
the displacement or datum be? - see figures B.7, B.8, B.9 and B.10 for answers to
these questions (pages B-11 B-13)
4(No Transcript)
5Types of Instructions
- Types of instructions
- arithmetic (integer) and logical operations
- floating point (, -, , /)
- notice that we are separating int and FP
operations this is because they will be handled
in different hardware and using different
register sets even though their format will
largely be the same - convert operations will also go into these two
categories to convert from int to fp and fp to
int - data transfer (load, store)
- control (branch, jump, proc call/return, trap)
- OS related (OS system call, v.memory, others)
- strings (move, compare, search)
- graphics (pixel operation, compress/decompr)
- we wont concern ourselves with these last three
categories in this course - See figure B.13 (page B-16) to see the most
common types of instructions as used in the 80x86
architecture
6Branch Instructions
- We focus on branch instructions because there are
some design issues to consider - these will usually be PC-relative branching
- adding an offset to the PC
- two advantages of PC-relative branching over
absolute branching - displacement is usually small allowing for
smaller instructions - displacement does not necessarily need to be
known at compile time allowing for easy use of
run-time loaded libraries - Types of branches
- procedure calls and returns
- usually the return address might be stored in the
run-time stack or in a register - parameters may also be passed by run-time stack
or through register windows - conditional branches
- jumps (unconditional)
- Conditional branches are the most common
- see figure B.14 on page B-17 where 75-82 of all
branches are conditional
7Branching Questions
- What is the form of condition?
- complex conditions can be time consuming
- some architectures limit the condition to be a
simple equality/inequality test - often, just a test of a register value against 0
- in other cases, the condition might involve
condition codes - When is the comparison performed?
- with the branch statement or prior to the branch
statement - that is, is the instruction a compare and branch,
or are they separated? - What is the distance of the branch?
- Figure B.15 shows that most branch distances are
lt 10 bits in length
8Encoding an Instruction Set
- Design issues
- variable-length vs fixed-length (or hybrid) (see
Figure B.18 on page B-22) - deriving op codes for each instruction
- although this is trivial, there are some concerns
should we have several ADD instructions, each
with a different type of operand expected or
specify operand type in the instruction (for
instance, AddInteger vs AddFP vs AddDouble, etc) - how are operand addresses specified?
- will there be separate bytes of the instruction
to specify this or will it be part of the
operand? - how many operands can an instruction reference?
- ideally, we want enough bits to specify 2 or 3
operands, by memory location, register,
displacement - Concerns
- want as many registers as possible BUT the more
registers, the more bits needed in the
instruction to address between them - many addressing modes are seldom used
- should they be omitted?
- instruction sizes should be based on bytes (e.g.,
8, 16, 24, 32 bits) - as instruction sizes increase, so do the size of
the programs!
9Compiler Optimizations
- In order to support the increasingly complex
hardware, we need compiler support in the form of
machine code optimizations, here are some
examples - High-level optimizations on source code
- example procedure in-lining, loop transformation
- Local optimizations on single-lines of code
- example change the order of references in a
block or expression - Global optimizations extend local across branches
- example loop unrolling
- Register allocation to optimize the storage of
variables in registers and minimize memory
fetches - Machine-dependent optimizations
- take advantage of the specific architecture
- see Figure B.19, page B-25
10Optimizations Continued
- Two example optimizations
- Sub-expression Elimination
- take a sub-expression that is used more than
once, store the result as computed the first time
in a register and reuse it rather than recompute
the expression again, or store it in memory only
to be retrieved again - Graph coloring
- an algorithmic technique to determine how values
can be distributed into registers (a heuristic or
approximate version is used since graph coloring
is NP complete)
- One problem with optimizations performed in
segments is phase-ordering - a transformation at one level may directly affect
possible optimizations at another level - for example, expanding a procedure at the
high-level without knowing the size of the
procedure - another example, register allocation is performed
near the end of the optimization techniques but
sub-expression elimination requires the
allocation of registers
11Introduction to MIPS
- MIPS is a RISC architecture derived from previous
RISC architectures - designed for pipeline efficiency and efficiency
as a compiler target - General-purpose register set and load-store
architecture - 32 64-bit general purpose (integer) registers
- labeled R0, , R31, where R0 is always 0
- values are sign extended when loaded into a
register if they are not of the right size - 32 64-bit floating point registers
- labeled F0, , F31 where only half the register
is used for floats - Addressing modes
- support displacement and immediate addressing
only - direct addressing can be accomplished by using R0
as the displacement - register indirect can be accomplished by using a
base of 0 - displacements of 12-16 bits and immediate data of
8-16 bits - memory is byte addressable and 64-bit addresses
are used
12More on MIPS
- Instructions
- less than 100 operations (op code requires 7
bits, however we will reduce this to 6 bits by
using one op code for all integer ALU operations) - 32-bit instructions
- 3 instruction formats used (shown on the next
slide) - I-type is used for loads, stores, conditional
branches and ALU operations that use immediate
data - R-type are used for all other ALU operations and
FP operations - J-type are only used for jump, jump and link
(procedure call), trap, return - Data types are 8, 16, 32 and 64 bit integer and
32 and 64 bit floating point - no character or string types (characters treated
as ints, strings as arrays of ints) - Immediate data and displacements are limited to
16 bits except for Jump instructions in which
case displacements are limited to 26 bits
13MIPS continued
3 operand instructions are available as long as
all operands are in registers (R-type) or 2
registers and immediate datum (I-type) immediate
datum (which is also used for displacement offset
s) is limited to 16 bits (2s complement) but
extended to 32 bits funct is the specific type
of ALU or FP function
14MIPS Architecture in 5 Stages
15Step-by-Step Description
- IF Stage
- PC sent to Instruction memory (cache) to fetch
next instruction - PC incremented by 4
- MUX is controlled by the previous instructions
result (if it was a branch) and the PC is either
replaced by the PC 4 or the new branch location - instruction moved into the IR
- ID Stage
- Instruction examined as follows
- bits 6..10 denote one source register (for I-type
and R-type) - bits 11..15 denote the other source register (for
R-type) - bits 16..32 store an immediate datum or
displacement this value is sign extended to 32
bits
1st register value placed in A 2nd register value
placed in B Sign extended value placed in Imm
16Architecture continued
- EX Stage
- if ALU (or FP) operation
- the two register values are sent to the ALU and
the proper circuit is selected - the result is passed on to the next stage
- if branch
- compare register to 0
- in ALU, use the adder to compute new PC (PC
displacement) - if load or store
- use the adder in the ALU to compute the datums
address (base displacement) by adding the given
registers value to the displacement value, send
this address to the next stage
17Architecture continued
- MEM stage
- if load or store, perform the data access given
the memory location as computed in the EX stage - on a load, the new datum is stored temporarily in
LMD - if branch, finish the condition using the MUX by
determining whether to replace the PC with PC 4
or the new address - WB stage
- if load or ALU operation, result is in LMD, use
the MUX to write the result to the proper
register in the register file (see the ID stage)
18Comments on the MIPS Architecture
- The simplified nature of MIPS means that many
tasks will require more than a single operation
(compared to more complex instruction sets that
could accomplish the task with 1 operation) - load registers before performing an ALU operation
on the values - two instructions to perform indirect memory
access - load the pointer from memory into a register,
then load the datum - two or more instructions to perform scaled or
indexed modes - The CPI of MIPS operations is less than in other
instruction sets making up for this - all operations have a CPI of 4 except Loads and
ALU operations which have a CPI of 5 (because
they must write their results to registers in the
WB stage) - The static size of all MIPS operations makes it
easier to deal with pre-fetching and pipelining
19More Comments
- The architecture requires the following hardware
elements to implement - the ALU should have all integer operations
(arithmetic, logic) - we address floating point operations later in the
semester - an additional adder for the IF stage
- several temporary registers
- IR, A, B, Imm, NPC, ALUOutput, LMD
- multiplexors to select
- what to do after a condition is evaluated
- whether a computed value is to be used later in
temporary registers A or B - whether to use a register value or the immediate
datum - multiplexors in the ALU to select the output
based on the specific ALU operation (not shown in
the figure) - multiplexors in the register file to select which
register to send on to the A or B temporary
registers, and a demultiplexor to pass along the
LMD value into one of the registers (not shown in
the figure)
20Sample Problem 1
- For each of the following compiler optimizations
below, explain whether the optimization will
provide a CPU performance increase because it
will reduce overall program CPI, reduce IC, or
reduce both - Code motion
- Register optimization with graph coloring
- Procedure integration
- Global common subexpression elimination
- Copy propagation
- Solution
- code motion in removing redundant instructions
from a loop, we are lowering the IC of the program
21Sample Problem 1 Solution continued
- Register optimization with graph coloring
- by better matching variables to registers, we can
reduce the number of loads and stores required,
so we are lowing IC - Procedure integration
- we replace procedure calls/returns with the code
itself reducing IC - we may also reduce run-time stack communication
so that accessing parameters is actually
accessing values in registers, thus reducing IC
even more as well as the overall CPI since there
will now be fewer memory operations - Global common subexpression elimination
- reusing a previously computed value removes later
computation, so this lowers IC - Copy propagation
- if the copy propagation is the replacement of a
variable with an immediate datum (as in changing
x y with x 5), this will lower CPI because we
are changing memory accesses with register
accesses - however, if the copy propagation is the
replacement of an expressions evaluation with an
immediate datum (as in changing x y z with x
5), then we are lowering IC
22Sample Problem 2
- Using the MIPS instruction set, write a set of
code to compute the average of the elements in an
int array - assume the array starts at memory location 50000
- assume the variable storing number of items of
the array is at memory location 10000 - Store the resulting float value at location 10004
23Sample Problem 3
- Using MIPS, write a set of code that will find
the largest and smallest items in an array - Array starts at a location pointed to by register
R5 - Array contains 500 elements
- Store the min in R1 and the max in R2
24Sample Problem 4
- Consider adding a register-memory ALU operation
to MIPS and replace the two instruction sequence - LW R1, 0(R2)
- ADD R3, R3, R1
- with
- ADD R3, 0(R2)
- Assume this will cause an increase to the clock
cycle of 5 and using the gcc benchmark (see
figure B.27), what percentage of loads must be
eliminated for the new machine to achieve the
same performance?
- CPI remains the same, so to achieve the same
performance, CPU timeold must equal CPU timenew
or - ICo CPIo Clock Cycle Rateo ICn CPIo
Clock Cycle Rateo 1.05 - Solving for ICn gives 1 / 1.05 .952 or we must
remove 4.8 of the instructions - We must remove enough Loads to reduce the overall
IC by 4.8 - Since Loads make up 25.1 of gccs instruction
mix, we must remove 4.8 / 25.1 19.1 of the
Loads
25Sample Problem 5
- Solution
- Average loads (26.5 25.1) / 2 25.8
- Average stores (10.3 13.2) / 2 11.75
- Average conditional branches (9.3 12.1) / 2
10.7 - Average jumps (includes returns and calls)
2.95 - Average ALU (50.1 47.2) / 2 48.65
- CPI (25.8 11.75) 1.4 10.7 60 2.0
10.7 40 1.5 2.95 1.2 48.65 1.0
1.2402
- Compute MIPS effective CPI for the gap and gcc
benchmarks (average their instruction mix) - assume 60 of conditional branches are taken and
all miscellaneous instructions are ALU
instructions - use the following CPI
- ALU instructions 1.0
- Load-stores 1.4
- Conditional Branches taken 2.0
- Conditional Branches not taken 1.5
- Jumps 1.2
26Sample Problem 6
- Consider the following two changes to the MIPS
architecture - 1. Move the MUX in the MEM stage into the EX
stage to complete branches there, reducing branch
CPI to 3 - 2. For ALU operations, perform the write back to
register in stage 4 instead of stage 5 by adding
another MUX - Assuming that these changes require an increase
in clock cycle time from 1 ns to 1.1 ns, is this
worth doing? Use the average integer benchmark
values in figure B.27 - Solution
- Recall that CPI was 5 for loads and ALU and 4 for
stores and branches, now CPI is 4 for ALU and 3
for branches - CPU Time old 5 (.26 .47) 4 (.10
.16) CCT old 4.69 CCT old - CPU Time new 5 .26 4 (.47 .10) 3
.16 CCT old 1.1 4.466 CCT old - So the changes provide a 4.69 / 4.466 1.05
speedup or a 5 speedup
27Sample Problem 7
- Architects are considering whether to add an
autoincrement/decrement addressing mode to MIPS - the consequences of such a change are that most
programs would require fewer instructions (IC
would be lowered) because the new mode would
combine the memory access and the change to the
offset in one operation - but would require a longer EX stage and thus a
longer clock cycle time - How should the architects decide whether to
include this mode? - Consider figure B.7 on page B-11 which shows that
benchmarks use displacement often for data
access (32-55) - each displacement access will probably be
followed by altering the offset (an add or
subtract) - assume the fraction of loads/stores is 40 and
that 40 of these use displacement, and that 80
of displacement addressing operations require a
change to the offset
28Solution
- Without autoincrement/decrement, then 80 of the
loads and stores are followed by an ALU operation - there are 40 loads and stores
- of the loads and stores, 40 use displacement
- of these, 80 increment or decrement the
displacement immediately after the load or store,
so we have - .4 .4 . 8 .128 (12.8) of the operations
are followed by an add or subtract - these ALU operations can now be removed, so IC
decreases by 12.8 - Recall CPU Time IC CPI Clock Cycle Time
- In order for the new mode to be worthwhile, the
decrease in IC must be at least as much as the
increase in Clock Cycle Time - if so, then CPU Time decreases, otherwise CPU
Time increases and the new addressing mode is not
worthwhile