Lectures for 2nd Edition - PowerPoint PPT Presentation

1 / 77

About This Presentation

Title:

Lectures for 2nd Edition

Description:

Software in high level language software in assembly language ... Verilog/VHDL. What is 'Computer Architecture'? I/O system. Instr. Set Proc. Compiler ... – PowerPoint PPT presentation

Number of Views:105

Avg rating:3.0/5.0

Slides: 78

Provided by: TodA1

Category:

more less

Transcript and Presenter's Notes

Title: Lectures for 2nd Edition

1
Chapter 2
2
Why study instruction sets?

Interface of hardware and software
Efficient mapping
Software in high level language ? software in
assembly language (instruction set) (Chapter 2)
Impact SW cost/performance
Instruction set ? hardware implementation
(Chapter 4)
Impact HW cost/performance

3
Electronic System Design Laboratory

GOAL
Training of students who are able master the
hardware/software co-design, co-simulation,
co-verification.

4
What is Computer Architecture?
Application
Operating
System
Compiler
Firmware
Instruction Set Architecture
I/O system
Instr. Set Proc.
Datapath Control
Digital Design
Circuit Design
Layout

Coordination of many levels of abstraction
Under a rapidly changing set of forces
Design, Measurement, and Evaluation

5
Instructions

Language of the Machine
Well be working with the MIPS instruction set
architecture
similar to other architectures developed since
the 1980's
Almost 100 million MIPS processors manufactured
in 2002
used by NEC, Nintendo, Cisco, Silicon Graphics,
Sony,

6
MIPS arithmetic

All instructions have 3 operands
Operand order is fixed (destination
first) Example C code a b c MIPS
code add a, b, c (well talk about
registers in a bit)The natural number of
operands for an operation like addition is
threerequiring every instruction to have exactly
three operands, no more and no less, conforms to
the philosophy of keeping the hardware simple

7
MIPS arithmetic

Design Principle simplicity favors regularity.
Of course this complicates some things... C
code a b c d MIPS code add a, b,
c add a, a, d
Operands must be registers, only 32 registers
provided
Each register contains 32 bits
Design Principle smaller is faster. Why?

8
Registers vs. Memory
Registers

Arithmetic instructions operands must be
registers, only 32 registers provided
Compiler associates variables with registers
What about programs with lots of variables

9
Memory Organization

Viewed as a large, single-dimension array, with
an address.
A memory address is an index into the array
"Byte addressing" means that the index points to
a byte of memory.

0
8 bits of data
1
8 bits of data
2
8 bits of data
3
8 bits of data
4
8 bits of data
5
8 bits of data
6
8 bits of data
...
10
Memory Organization

Bytes are nice, but most data items use larger
"words"
For MIPS, a word is 32 bits or 4 bytes.
232 bytes with byte addresses from 0 to 232-1
230 words with byte addresses 0, 4, 8, ... 232-4
Words are aligned i.e., what are the least 2
significant bits of a word address?

0
32 bits of data
4
32 bits of data
Registers hold 32 bits of data
8
32 bits of data
12
32 bits of data
...
11
MIPS arithmetic (with registers)

All instructions have 3 operands
Operand order is fixed (destination
first) Example C code A B C MIPS
code add s0, s1, s2 (associated
with variables by compiler)

12
MIPS arithmetic (with registers)

Design Principle simplicity favors regularity.
Why?
Of course this complicates some things... C
code A B C D E F - A MIPS
code add t0, s1, s2 add s0, t0,
s3 sub s4, s5, s0
Which variables go to which registers?
Operands must be registers, only 32 registers
provided
Design Principle smaller is faster. Why?
Note
Additional register usage t0 (allocated by the
compiler)

13
Operand in Memory

Base address and offset C code g h
A8 MIPS code lw t0, 8(s3) assume
s3 have the start address of A matrix, 8 is
offset add s1, s2, t0

14
Instructions

Load and store instructions
Example C code A8 h A8 MIPS
code lw t0, 32(s3) s3A, 3284 add t0,
s2, t0 sw t0, 32(s3)
Store word has destination last
Remember
Operands of arithmetic/logic instructions are
registers, not memory!
Load/store instructions have one memory operand.
Note
Temporary register t0
Array name a register s3
Displacement 32, not 8!

15
Our First Example

Can we figure out the code?

swap(int v, int k) int temp temp
vk vk vk1 vk1 temp
swap muli 2, 5, 4 add 2, 4, 2 s2
addr.of vk lw 15, 0(2) lw 16, 4(2) sw
16, 0(2) sw 15, 4(2) jr 31 Return
addr. is saved in s31
16
So far weve learned

MIPS loading words but addressing bytes
arithmetic on registers only
Instruction Meaning (Register Transfer
Language, RTL)add s1, s2, s3 s1 s2
s3sub s1, s2, s3 s1 s2 s3lw s1,
100(s2) s1 Memorys2100 sw s1,
100(s2) Memorys2100 s1

17
Machine Language add/sub (arithmetic)

Instructions, like registers and words of data,
are also 32 bits long
Example add t0, s1, s2
registers have numbers, t08, s117, s218
Instruction Format 000000 10001 10010 01000 000
00 100000 op rs rt rd shamt funct
Can you guess what the field names stand for?

18
Machine Language load/store

Now include the load/store instructions into the
same instruction format (regularity principle)
Example lw s1, 32(s2)
registers have numbers, s12, s218
Using the same Instruction Format as arithmetic
operations 100011 10010 xxxxx 00010 00000 10000
0 op rs rt rd shamtH shamtL
Can you see any problem?

19
Machine Language load/store instructions

Consider the load-word and store-word
instructions,
What would the regularity principle have us do?
New principle Good design demands a compromise
Introduce a new type of instruction format
I-type for data transfer instructions
other format was R-type for register
Example lw t0, 32(s2) 35 18 2
32 op rs rt 16 bit number
Where's the compromise?

20
Machine Language

PROBLEM How to access an array element with
displacement gt 216?
Displacement gt 216? XA100000.
Assume t1 is a temporary 32-bit register .
m1024 the memory location
which has a large value.
Its address is calculated by 0(s).
t3 is a register contain the
base address of array A.
t4 is a temporary 32 bits
register .

lw t1 , 0(s2) //load immediate to t1. add t4
, t3 , t1 //calculate the
displacement. lw t5 , 0(t4) //load
the displacement to t5.
?
t1
?
m1024
t3
?
?
Displacementgt216
?
?
t5
A100000
21
Stored Program Concept

Instructions are bits
Programs are stored in memory to be read or
written just like data
Fetch Execute Cycle
Instructions are fetched and put into a special
register instruction register
Bits in the register "control" the subsequent
actions
Fetch the next instruction and continue

memory for data, programs, compilers, editors,
etc.
22
Control

Decision making instructions
alter the control flow,
i.e., change the "next" instruction to be
executed
Sequential execution implicitly implied!
MIPS conditional branch instructions bne t0,
t1, Label beq t0, t1, Label
Example if (ij) h i j bne s0, s1,
Label add s3, s0, s1 Label ....

23
Control

MIPS unconditional branch instructions j label
Example if (ij) bne s4, s5, Lab1
hij add s3, s4, s5 else j Lab2
hi-j Lab1 sub s3, s4, s5 ...
Lab2 ...
Can you build a simple for loop?

24
Control (II)

MIPS unconditional branch instructions j label
Example if (i!j) beq s4, s5, Lab1
hi-j sub s3, s4, s5 else j Lab2
hij Lab1 add s3, s4, s5 ...
Lab2 ...
Can you build a simple for loop?

25
Is one enough?

MIPS conditional branch instructions bne t0,
t1, Label branch if not equal
beq t0, t1, Label branch if equal
Since bne and beq are complement, can we use and
implement only one of them in software and
hardware?

26
So far

Instruction Meaning (Register Transfer Language,
RTL)add s1,s2,s3 s1 s2 s3sub
s1,s2,s3 s1 s2 s3lw s1,100(s2) s1
Memorys2100 sw s1,100(s2) Memorys2100
s1bne s4,s5,L Next instr. is at Label if
s4 s5beq s4,s5,L Next instr. is at Label
if s4 s5j Label Next instr. is at Label
Formats

R I J
27
Control Flow

We have beq, bne, what about Branch-if-less-than
?
New instruction if s1 lt s2 then
t0 1 slt t0, s1, s2 else t0
0
Can use this instruction to build "blt s1, s2,
Label" can now build general control
structures
Note that the assembler needs two registers to do
this, there are policy of use conventions for
registers
Pseudo instruction "blt s1, s2, Label"
Mapped to
slt t0, s1, s2
beq t0, t1, Label t11

28
Compiling Loops in C

Use shift left logic (sll) to multiply 4 C
code while (savei k) i 1 MIPS
code Loop sll t1, s3, 2 add t1, t1,
s6 lw t0, 0(t1) bne t0, s5,
Exit add s3, s3, 1 j Loop Exit

29
Procedure Call basic concept

Caller
The program that instigates a procedure and
provides the necessary parameter values.
Callee
A procedure that executes a series of stored
instructions based on parameters provided by the
caller and then returns control to the caller.
Return Address
A link to the calling site that allows a
procedure to return to the proper address in
MIPS it is stored in the register ra
Stack
A data structure for spilling registers organized
as a list-in-first-out queue.
Stack Point
A value denoting the most revently allocated
address in a stack that slow where registers
should be spilled or where old register values
can be found.

30
Allocate New Data on the Stack

Frame pointer (fp)
Help sp to save the first address of the callee
procedure (a stable based register within a
procedure for local memory references sp might
be changed during the procedure.)

31
Saving registers

Both leaf and non-leaf procedures need to save
Saved registers
Non-leaf procedures need to save additionally
Argument registers
Temporary registers
Return register

32
C Pure Procedure

Stack pointer (sp) and return address (ra) C
code int leaf_example(int g, int h, int i, int
j) int f f (g h) (i
j) return f MIPS code leaf_example
addi sp, sp, -12 backup the values sw
t1, 8(sp) of registers which sw t0,
4(sp) will be used in this sw s0,
0(sp) procedure add t0, a0,
a1 add t1, a2, a3 sub s0, t0,
t1 add v0, s0, zero lw s0, 0(sp)
restore the values lw t0, 4(sp)
saved in stack lw t1, 8(sp)
previously addi sp, sp, 12 jr ra
return address

33
Recursive Procedure

Stack pointer (sp) and return address (ra) C
code int fact(int n) if(n lt 1) return
(1) else return (n fact(n-1)) MIPS
code fact addi sp, sp, -8 sw ra,
4(sp) backup the return address sw
a0, 0(sp) argument n slti t0, a0,
1 beq t0, zero, L1 addi v0, zero,
1 addi sp, sp, 8 jr ra L1 addi a0,
a0, -1 jal fact lw a0, 0(sp)
restore the return address lw ra, 4(sp)
argument n addi sp, sp, 8 mul v0,
a0, v0 jr ra return to the
caller

34
(No Transcript)
35
Policy of Use Conventions
Register 1 (at) reserved for assembler, 26-27
for operating system
36
Allocate New Data on the Heap

Heap vs. stack
Heap used to save static variable and dynamic
data structure

sp
7fff fffchex
Stack Dynamic data
Static data
gp
1000 8000hex
Text
1000 0000hex
pc
0040 0000hex
Reserved
0
37
String Copy Procedure

Stack pointer (sp) and return address (ra) C
code void strcpy(char x, char
y) int i i 0 while((xi
yi) ! \0) i 1 MIPS
code strcpy addi sp, sp, -4 sw s0,
0(sp) backup s0 add s0, zero, zero
initial i to 0 L1 add t1, s0, a1 lb
t2, 0(t1) add t3, s0, a0 sb t2,
0(t3) beq t2, zero, L2 addi s0, s0,
1 j L1 L2 lw s0, 0(sp) end of
string addi sp, sp, 4 jr ra
return to the caller

38
Constants

Small constants are used quite frequently (50 of
operands) e.g., A A 5 B B 1 C
C - 18
Solutions? Why not?
put 'typical constants' in memory and load them.
create hard-wired registers (like zero) for
constants like one.
From an instruction field
MIPS Instructions addi 29, 29, 4 slti 8,
18, 10 andi 29, 29, 6 ori 29, 29, 4
Design Principle Make the common case fast.
Which format?

39
How about larger constants?

We'd like to be able to load a 32 bit constant
into a register
Must use two instructions, new "load upper
immediate" instruction lui t0,
1010101010101010
Then must get the lower order bits right,
i.e., ori t0, t0, 1010101010101010

1010101010101010
0000000000000000
0000000000000000
1010101010101010
ori
40
Assembly Language vs. Machine Language

Assembly provides convenient symbolic
representation
much easier than writing down numbers
e.g., destination first
Machine language is the underlying reality
e.g., destination is no longer first
Assembly can provide 'pseudoinstructions'
e.g., move t0, t1 exists only in Assembly
would be implemented using add t0,t1,zero
When considering performance you should count
real instructions

41
Other Issues

Things we are not going to cover in
lecture support for procedures linkers,
loaders, memory layout stacks, frames,
recursion manipulating strings and
pointers interrupts and exceptions system calls
and conventions
Some of these we'll talk about later
We've focused on architectural issues
basics of MIPS assembly language and machine code
well build a processor to execute these
instructions.

42
Overview of MIPS

simple instructions all 32 bits wide
very structured, no unnecessary baggage
only three instruction formats
rely on compiler to achieve performance what
are the compiler's goals?
help compiler where we can

op rs rt rd shamt funct
R I J
op rs rt 16 bit address
op 26 bit address
43
Addresses in Branches

Instructions
bne t4,t5,Label Next instruction is at Label if
t4t5
beq t4,t5,Label Next instruction is at Label if
t4t5
Formats
Could specify a register (like lw and sw) and add
it to address
use Instruction Address Register (PC program
counter)
most branches are local (principle of locality)
Jump instructions just use high order bits of PC
address boundaries of 256 MB

op rs rt 16 bit address
I
44
Addresses in Branches and Jumps

Instructions
bne t4,t5,Label Next instruction is at Label
if t4 t5
beq t4,t5,Label Next instruction is at Label
if t4 t5
j Label Next instruction is at Label
Formats
Addresses are not 32 bits How do we handle
this with load and store instructions?

op rs rt 16 bit address
I J
op 26 bit address
45
To summarize
46
displacement
47
Program Translation

Translation Hierarchy (Unix file, Windows file
system)

C Program
.c, .C
Compiler
.s, .ASM
Assembly
Library .a, .LIB
Assembler
Dynamic linked library.so, .DLL
Object Library routine (machine code)
.o, .OBJ
Object Machine language
Linker
a.out, .EXE
Executable Machine program
Loader
Memory
48
Linking Object Files

Reallocate the address in text segment and data
segment
Link procedure A B

49
Reallocated Executable Image

Text segment starts at 40 0000
Data segment starts at 1000 0000 8000 gp

Stack Dynamic data
Static data
1000 0000hex
Text
0040 0000hex
pc
Reserved
0
50
Loader

Operating system read executable file to memory
and start it

Read header determine size of text and data
segment
Create space for text and data segment
Copy instructions and data into memory
Copy parameter to main program
Initialize register and stack pointer
Jump to start-up routine
Exit system call
51
Dynamic Linked Library

Disadvantage of static library routine
In update the library become old in code when
new one is released
In size library routine become part of the code
Lazy procedure linkage
Overhead on first time called
Pay nothing when return from the library

52
Lazy Procedure Linkage
jal lw jr
Text
jal lw jr
Text

Data

Data
Indirect jump
Indirect jump
li ID j
Text
Dynamic Linker/Loader j
Text
First call
Subsequent call
DLL Routine jr
Text
DLL Routine jr
Text
53
Java Program

Java feature
Run on any computer
Slow execution time
Compile to Java bytecode instructions that easy
to interpret

Java Program
Compiler
Class files (Java bytecode)
Java Library routine (machine language)
Java Virtual Machine
Just In TimeCompiler
Software interpreter
Compiled Java methods (machine language)
54
Passes or Phases in Optimizing Compiler

High-level optimization
Procedure inlining
Reduce loop overhead
Loop unrolling
Improve memory behavior
Interchange nested loop
Blocking loop

Dependencies
Front end perlanguage
Machine
Language
Dependent
Independent
Intermediaterepresentation
High-leveloptimization
Globaloptimizer
Code generator
Independent
Dependent
55
Local Optimizations

Common subexpression elimination to compute
xi xi 4 xi 4 li R100, x li
R100, x lw R101,i lw R101,i mult R102,
R101, 4 mult R102, R101, 4 add R103, R100,
R102 add R103, R100, R102 lw R104, 0(R103)
lw R104, 0(R103) xi in R104 add R105, R104,
4 add R105, R104, 4 xi li R106, x
sw R105, 0(R103) lw R107, i mult R108, R107,
4 add R109, R016, R107 sw R105, 0(R109)
Strength reduction replace mult by shift left
Constant propagation collapse constant whenever
possible
Copy propagation eliminate the need to reload
value
Dead store elimination eliminate store value
not used again
Dead code elimination eliminate the code which
not affect final result

56
Global Optimization

The same as local optimization and more
Code motion
eliminate invariant loop
Induction variable elimination
reduce overhead on indexing array into pointer
accesses

57
Optimization in gcc Level
58
Compiler Optimization for Bubble Sort

Performance, instruction count, and CPI
comparison
Pentium 4, clock rate 3.06GHz, 533MHz system bus,
with 2 GB of PC2100 DDR SDRAM memory
Linux version 2.4.20

59
Performance of C and Java

Use two sort algorithms
C optimizing compiler
Java interpreter

60
C Swap Example

Swap two location in memory C code void
swap(int v, int k) int temp temp
vk vk vk1 vk1
temp MIPS code swap sll t1, a1, 2
t1 k 4 add t1, a0, t1 t1 v
(k 4) lw t0, 0(t1) temp vk
lw t2, 4(t1) t2 vk1 sw t2,
0(t1) vk t2 sw t0, 4(t1)
vk1 temp jr ra

61
C Sort Example

Sort function call swap C code void sort(int
v, int k) int i,j for( i 0 i lt
n i 1) for( j i - 1 j gt
0 vj gt vj1 j - 1) swap(v,
j)

62
MIPS Code Translation (I)

Saving registersort addi sp, sp, -20 sw
ra, 16(sp) sw s3, 12(sp) sw s2,
8(sp) sw s1, 4(sp) sw s0, 0(sp)
Move parameters move s2, a0 move s3, a1
Outer loop move s0, zero i
0for1tst slt t0, s0, s3 if( i gt n) beq
t0, zero, exit1 then go to exit1 (inner
loop) (pass parameters and call)exit2 addi
s0, s0, 1 j for1tst

63
MIPS Code Translation (II)

Inner loop addi s1, s0, -1 j i -
1for2tst slti t0, s1, 0 if(j lt 0) bne
t0, zero, exit2 then go to exit2 sll t1,
s1, 2 add t2, s2, t1 lw t3, 0(t2) lw
t4, 4(t2) slt t0, t4, t3 beq t0, zero,
exit2 (pass parameters and call) addi s1,
s1, -1 j for2tst
Pass parameters and call move a0, s2 move
a1, s1 jal swap
Restoring register lw s0, 0(sp) lw s1,
4(sp) lw s2, 8(sp) lw s3, 12(sp) lw
ra, 16(sp) addi sp, sp, 20
Procedure return jr ra

64
Alternative Architectures

Design alternative
provide more powerful operations
goal is to reduce number of instructions executed
danger is a slower cycle time and/or a higher
CPI
Lets look (briefly) at IA-32

The path toward operation complexity is thus
fraught with peril. To avoid these problems,
designers have moved toward simpler instructions

65
IA - 32

1978 The Intel 8086 is announced (16 bit
architecture)
1980 The 8087 floating point coprocessor is
added
1982 The 80286 increases address space to 24
bits, instructions
1985 The 80386 extends to 32 bits, new
addressing modes
1989-1995 The 80486, Pentium, Pentium Pro add a
few instructions (mostly designed for higher
performance)
1997 57 new MMX instructions are added,
Pentium II
1999 The Pentium III added another 70
instructions (SSE)
2001 Another 144 instructions (SSE2)
2003 AMD extends the architecture to increase
address space to 64 bits, widens all registers
to 64 bits and other changes (AMD64)
2004 Intel capitulates and embraces AMD64
(calls it EM64T) and adds more media extensions
This history illustrates the impact of the
golden handcuffs of compatibilityadding new
features as someone might add clothing to a
packed bagan architecture that is difficult
to explain and impossible to love

66
IA-32 Overview

Complexity
Instructions from 1 to 17 bytes long
one operand must act as both a source and
destination
one operand can come from memory
complex addressing modes e.g., base or scaled
index with 8 or 32 bit displacement
Saving grace
the most frequently used instructions are not too
difficult to build
compilers avoid the portions of the architecture
that are slow
what the 80x86 lacks in style is made up in
quantity, making it beautiful from the right
perspective

67
IA-32 Registers and Data Addressing

Registers in the 32-bit subset that originated
with 80386

68
IA-32 Register Restrictions

Registers are not general purpose note the
restrictions below

69
IA-32 Typical Instructions

Four major types of integer instructions
Data movement including move, push, pop
Arithmetic and logical (destination register or
memory)
Control flow (use of condition codes / flags )
String instructions, including string move and
string compare

70
IA-32 instruction Formats

Typical formats (notice the different lengths)

71
Optimization in gcc Level
72
Compiler Optimization for Bubble Sort

Performance, instruction count, and CPI
comparison
Pentium 4, clock rate 3.06GHz, 533MHz system bus,
with 2 GB of PC2100 DDR SDRAM memory
Linux version 2.4.20

73
Performance of C and Java

Use two sort algorithms
C optimizing compiler
Java interpreter

74
Summary

Instruction complexity is only one variable
lower instruction count vs. higher CPI / lower
clock rate
Design Principles
simplicity favors regularity
smaller is faster
good design demands compromise
make the common case fast
Instruction set architecture
a very important abstraction indeed!

75
A dominant architecture 80x86

See your textbook for a more detailed description
Complexity
Instructions from 1 to 17 bytes long
one operand must act as both a source and
destination
one operand can come from memory
complex addressing modes e.g., base or scaled
index with 8 or 32 bit displacement
Saving grace
the most frequently used instructions are not too
difficult to build
compilers avoid the portions of the architecture
that are slow
what the 80x86 lacks in style is made up in
quantity, making it beautiful from the right
perspective

76
PowerPC

Indexed addressing
example lw t1,a0s3 t1Memorya0s3
What do we have to do in MIPS?
Update addressing
update a register as part of load (for marching
through arrays)
example lwu t0,4(s3) t0Memorys34s3s3
4
What do we have to do in MIPS?
Others
load multiple/store multiple
a special counter register bc Loop
decrement counter, if not 0 goto loop

77
Alternative Architectures

Design alternative
provide more powerful operations
goal is to reduce number of instructions executed
danger is a slower cycle time and/or a higher CPI
Sometimes referred to as RISC vs. CISC
virtually all new instruction sets since 1982
have been RISC
VAX minimize code size, make assembly language
easy instructions from 1 to 54 bytes long!
Well look at PowerPC and 80x86