Instruction Set Principles

About This Presentation

Title:

Instruction Set Principles

Description:

Graph coloring ... (a heuristic or approximate version is used since graph coloring is NP complete) ... Register optimization with graph coloring: ... – PowerPoint PPT presentation

Number of Views:88

Avg rating:3.0/5.0

Slides: 29

Provided by: rfox

Category:

more less

Transcript and Presenter's Notes

Title: Instruction Set Principles

1
Instruction Set Principles

The instruction set is the portion of the
architecture that is visible to the programmer
(or compiler writer) here we look at various
issues involved in designing an instruction set,
and then look at the MIPS architecture
Register Issues
compilers will reserve registers for temporary
variables, parameter passing, commonly used
variables
are instructions 2-operand or 3-operand?
how many of the operands may be memory addresses
or is it a load-store instruction set?
Memory issues
how are memory addresses interpreted (big or
little endian)?
this is an issue only if accesses can be made to
sizes smaller than a word although this is
typical in many computers, so it must be
addressed
how are memory addresses specified?
what addressing modes
how many addresses modes and will complex ones be
allowed?

2
Comparison of Register Set Types
3
Addressing Modes

Instruction operands will reference
a constant (immediate datum)
a register
a location in memory
or some combination of these
Numerous addressing modes can significantly
reduce the instruction count of a program
however, many of these addressing modes will add
to the CPI of the instruction due to the time it
takes to compute the effective address
Types of Modes
PC-relative addressing to specify an
instructions location as used in branches
Data addressing modes (as shown on the next
slide)
Design issues
Displacement
how big is the displacement?
most of the SPEC benchmarks use displacements of
no more than 15 bits
Immediate data
How often are these modes used and how big should
the displacement or datum be?
see figures B.7, B.8, B.9 and B.10 for answers to
these questions (pages B-11 B-13)

4
(No Transcript)
5
Types of Instructions

Types of instructions
arithmetic (integer) and logical operations
floating point (, -, , /)
notice that we are separating int and FP
operations this is because they will be handled
in different hardware and using different
register sets even though their format will
largely be the same
convert operations will also go into these two
categories to convert from int to fp and fp to
int
data transfer (load, store)
control (branch, jump, proc call/return, trap)
OS related (OS system call, v.memory, others)
strings (move, compare, search)
graphics (pixel operation, compress/decompr)
we wont concern ourselves with these last three
categories in this course
See figure B.13 (page B-16) to see the most
common types of instructions as used in the 80x86
architecture

6
Branch Instructions

We focus on branch instructions because there are
some design issues to consider
these will usually be PC-relative branching
adding an offset to the PC
two advantages of PC-relative branching over
absolute branching
displacement is usually small allowing for
smaller instructions
displacement does not necessarily need to be
known at compile time allowing for easy use of
run-time loaded libraries
Types of branches
procedure calls and returns
usually the return address might be stored in the
run-time stack or in a register
parameters may also be passed by run-time stack
or through register windows
conditional branches
jumps (unconditional)
Conditional branches are the most common
see figure B.14 on page B-17 where 75-82 of all
branches are conditional

7
Branching Questions

What is the form of condition?
complex conditions can be time consuming
some architectures limit the condition to be a
simple equality/inequality test
often, just a test of a register value against 0
in other cases, the condition might involve
condition codes
When is the comparison performed?
with the branch statement or prior to the branch
statement
that is, is the instruction a compare and branch,
or are they separated?
What is the distance of the branch?
Figure B.15 shows that most branch distances are
lt 10 bits in length

8
Encoding an Instruction Set

Design issues
variable-length vs fixed-length (or hybrid) (see
Figure B.18 on page B-22)
deriving op codes for each instruction
although this is trivial, there are some concerns
should we have several ADD instructions, each
with a different type of operand expected or
specify operand type in the instruction (for
instance, AddInteger vs AddFP vs AddDouble, etc)
how are operand addresses specified?
will there be separate bytes of the instruction
to specify this or will it be part of the
operand?
how many operands can an instruction reference?
ideally, we want enough bits to specify 2 or 3
operands, by memory location, register,
displacement
Concerns
want as many registers as possible BUT the more
registers, the more bits needed in the
instruction to address between them
many addressing modes are seldom used
should they be omitted?
instruction sizes should be based on bytes (e.g.,
8, 16, 24, 32 bits)
as instruction sizes increase, so do the size of
the programs!

9
Compiler Optimizations

In order to support the increasingly complex
hardware, we need compiler support in the form of
machine code optimizations, here are some
examples
High-level optimizations on source code
example procedure in-lining, loop transformation
Local optimizations on single-lines of code
example change the order of references in a
block or expression
Global optimizations extend local across branches
example loop unrolling
Register allocation to optimize the storage of
variables in registers and minimize memory
fetches
Machine-dependent optimizations
take advantage of the specific architecture
see Figure B.19, page B-25

10
Optimizations Continued

Two example optimizations
Sub-expression Elimination
take a sub-expression that is used more than
once, store the result as computed the first time
in a register and reuse it rather than recompute
the expression again, or store it in memory only
to be retrieved again
Graph coloring
an algorithmic technique to determine how values
can be distributed into registers (a heuristic or
approximate version is used since graph coloring
is NP complete)

One problem with optimizations performed in
segments is phase-ordering
a transformation at one level may directly affect
possible optimizations at another level
for example, expanding a procedure at the
high-level without knowing the size of the
procedure
another example, register allocation is performed
near the end of the optimization techniques but
sub-expression elimination requires the
allocation of registers

11
Introduction to MIPS

MIPS is a RISC architecture derived from previous
RISC architectures
designed for pipeline efficiency and efficiency
as a compiler target
General-purpose register set and load-store
architecture
32 64-bit general purpose (integer) registers
labeled R0, , R31, where R0 is always 0
values are sign extended when loaded into a
register if they are not of the right size
32 64-bit floating point registers
labeled F0, , F31 where only half the register
is used for floats
Addressing modes
support displacement and immediate addressing
only
direct addressing can be accomplished by using R0
as the displacement
register indirect can be accomplished by using a
base of 0
displacements of 12-16 bits and immediate data of
8-16 bits
memory is byte addressable and 64-bit addresses
are used

12
More on MIPS

Instructions
less than 100 operations (op code requires 7
bits, however we will reduce this to 6 bits by
using one op code for all integer ALU operations)
32-bit instructions
3 instruction formats used (shown on the next
slide)
I-type is used for loads, stores, conditional
branches and ALU operations that use immediate
data
R-type are used for all other ALU operations and
FP operations
J-type are only used for jump, jump and link
(procedure call), trap, return
Data types are 8, 16, 32 and 64 bit integer and
32 and 64 bit floating point
no character or string types (characters treated
as ints, strings as arrays of ints)
Immediate data and displacements are limited to
16 bits except for Jump instructions in which
case displacements are limited to 26 bits

13
MIPS continued
3 operand instructions are available as long as
all operands are in registers (R-type) or 2
registers and immediate datum (I-type) immediate
datum (which is also used for displacement offset
s) is limited to 16 bits (2s complement) but
extended to 32 bits funct is the specific type
of ALU or FP function
14
MIPS Architecture in 5 Stages
15
Step-by-Step Description

IF Stage
PC sent to Instruction memory (cache) to fetch
next instruction
PC incremented by 4
MUX is controlled by the previous instructions
result (if it was a branch) and the PC is either
replaced by the PC 4 or the new branch location
instruction moved into the IR
ID Stage
Instruction examined as follows
bits 6..10 denote one source register (for I-type
and R-type)
bits 11..15 denote the other source register (for
R-type)
bits 16..32 store an immediate datum or
displacement this value is sign extended to 32
bits

1st register value placed in A 2nd register value
placed in B Sign extended value placed in Imm
16
Architecture continued

EX Stage
if ALU (or FP) operation
the two register values are sent to the ALU and
the proper circuit is selected
the result is passed on to the next stage
if branch
compare register to 0
in ALU, use the adder to compute new PC (PC
displacement)
if load or store
use the adder in the ALU to compute the datums
address (base displacement) by adding the given
registers value to the displacement value, send
this address to the next stage

17
Architecture continued

MEM stage
if load or store, perform the data access given
the memory location as computed in the EX stage
on a load, the new datum is stored temporarily in
LMD
if branch, finish the condition using the MUX by
determining whether to replace the PC with PC 4
or the new address
WB stage
if load or ALU operation, result is in LMD, use
the MUX to write the result to the proper
register in the register file (see the ID stage)

18
Comments on the MIPS Architecture

The simplified nature of MIPS means that many
tasks will require more than a single operation
(compared to more complex instruction sets that
could accomplish the task with 1 operation)
load registers before performing an ALU operation
on the values
two instructions to perform indirect memory
access
load the pointer from memory into a register,
then load the datum
two or more instructions to perform scaled or
indexed modes
The CPI of MIPS operations is less than in other
instruction sets making up for this
all operations have a CPI of 4 except Loads and
ALU operations which have a CPI of 5 (because
they must write their results to registers in the
WB stage)
The static size of all MIPS operations makes it
easier to deal with pre-fetching and pipelining

19
More Comments

The architecture requires the following hardware
elements to implement
the ALU should have all integer operations
(arithmetic, logic)
we address floating point operations later in the
semester
an additional adder for the IF stage
several temporary registers
IR, A, B, Imm, NPC, ALUOutput, LMD
multiplexors to select
what to do after a condition is evaluated
whether a computed value is to be used later in
temporary registers A or B
whether to use a register value or the immediate
datum
multiplexors in the ALU to select the output
based on the specific ALU operation (not shown in
the figure)
multiplexors in the register file to select which
register to send on to the A or B temporary
registers, and a demultiplexor to pass along the
LMD value into one of the registers (not shown in
the figure)

20
Sample Problem 1

For each of the following compiler optimizations
below, explain whether the optimization will
provide a CPU performance increase because it
will reduce overall program CPI, reduce IC, or
reduce both
Code motion
Register optimization with graph coloring
Procedure integration
Global common subexpression elimination
Copy propagation
Solution
code motion in removing redundant instructions
from a loop, we are lowering the IC of the program

21
Sample Problem 1 Solution continued

Register optimization with graph coloring
by better matching variables to registers, we can
reduce the number of loads and stores required,
so we are lowing IC
Procedure integration
we replace procedure calls/returns with the code
itself reducing IC
we may also reduce run-time stack communication
so that accessing parameters is actually
accessing values in registers, thus reducing IC
even more as well as the overall CPI since there
will now be fewer memory operations
Global common subexpression elimination
reusing a previously computed value removes later
computation, so this lowers IC
Copy propagation
if the copy propagation is the replacement of a
variable with an immediate datum (as in changing
x y with x 5), this will lower CPI because we
are changing memory accesses with register
accesses
however, if the copy propagation is the
replacement of an expressions evaluation with an
immediate datum (as in changing x y z with x
5), then we are lowering IC

22
Sample Problem 2

Using the MIPS instruction set, write a set of
code to compute the average of the elements in an
int array
assume the array starts at memory location 50000
assume the variable storing number of items of
the array is at memory location 10000
Store the resulting float value at location 10004

23
Sample Problem 3

Using MIPS, write a set of code that will find
the largest and smallest items in an array
Array starts at a location pointed to by register
R5
Array contains 500 elements
Store the min in R1 and the max in R2

24
Sample Problem 4

Consider adding a register-memory ALU operation
to MIPS and replace the two instruction sequence
LW R1, 0(R2)
ADD R3, R3, R1
with
ADD R3, 0(R2)
Assume this will cause an increase to the clock
cycle of 5 and using the gcc benchmark (see
figure B.27), what percentage of loads must be
eliminated for the new machine to achieve the
same performance?

CPI remains the same, so to achieve the same
performance, CPU timeold must equal CPU timenew
or
ICo CPIo Clock Cycle Rateo ICn CPIo
Clock Cycle Rateo 1.05
Solving for ICn gives 1 / 1.05 .952 or we must
remove 4.8 of the instructions
We must remove enough Loads to reduce the overall
IC by 4.8
Since Loads make up 25.1 of gccs instruction
mix, we must remove 4.8 / 25.1 19.1 of the
Loads

25
Sample Problem 5

Solution
Average loads (26.5 25.1) / 2 25.8
Average stores (10.3 13.2) / 2 11.75
Average conditional branches (9.3 12.1) / 2
10.7
Average jumps (includes returns and calls)
2.95
Average ALU (50.1 47.2) / 2 48.65
CPI (25.8 11.75) 1.4 10.7 60 2.0
10.7 40 1.5 2.95 1.2 48.65 1.0
1.2402

Compute MIPS effective CPI for the gap and gcc
benchmarks (average their instruction mix)
assume 60 of conditional branches are taken and
all miscellaneous instructions are ALU
instructions
use the following CPI
ALU instructions 1.0
Load-stores 1.4
Conditional Branches taken 2.0
Conditional Branches not taken 1.5
Jumps 1.2

26
Sample Problem 6

Consider the following two changes to the MIPS
architecture
1. Move the MUX in the MEM stage into the EX
stage to complete branches there, reducing branch
CPI to 3
2. For ALU operations, perform the write back to
register in stage 4 instead of stage 5 by adding
another MUX
Assuming that these changes require an increase
in clock cycle time from 1 ns to 1.1 ns, is this
worth doing? Use the average integer benchmark
values in figure B.27
Solution
Recall that CPI was 5 for loads and ALU and 4 for
stores and branches, now CPI is 4 for ALU and 3
for branches
CPU Time old 5 (.26 .47) 4 (.10
.16) CCT old 4.69 CCT old
CPU Time new 5 .26 4 (.47 .10) 3
.16 CCT old 1.1 4.466 CCT old
So the changes provide a 4.69 / 4.466 1.05
speedup or a 5 speedup

27
Sample Problem 7

Architects are considering whether to add an
autoincrement/decrement addressing mode to MIPS
the consequences of such a change are that most
programs would require fewer instructions (IC
would be lowered) because the new mode would
combine the memory access and the change to the
offset in one operation
but would require a longer EX stage and thus a
longer clock cycle time
How should the architects decide whether to
include this mode?
Consider figure B.7 on page B-11 which shows that
benchmarks use displacement often for data
access (32-55)
each displacement access will probably be
followed by altering the offset (an add or
subtract)
assume the fraction of loads/stores is 40 and
that 40 of these use displacement, and that 80
of displacement addressing operations require a
change to the offset

28
Solution

Without autoincrement/decrement, then 80 of the
loads and stores are followed by an ALU operation
there are 40 loads and stores
of the loads and stores, 40 use displacement
of these, 80 increment or decrement the
displacement immediately after the load or store,
so we have
.4 .4 . 8 .128 (12.8) of the operations
are followed by an add or subtract
these ALU operations can now be removed, so IC
decreases by 12.8
Recall CPU Time IC CPI Clock Cycle Time
In order for the new mode to be worthwhile, the
decrease in IC must be at least as much as the
increase in Clock Cycle Time
if so, then CPU Time decreases, otherwise CPU
Time increases and the new addressing mode is not
worthwhile