Chapter 2: Instruction Set Architecture - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Chapter 2: Instruction Set Architecture

Description:

Chapter 2: Instruction Set Architecture. Principles underlying modern ISA ... Sematic Clash: '... by giving too much semantic content to the instruction, the ... – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 44

Provided by: sari158

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 2: Instruction Set Architecture

1
Chapter 2 Instruction Set Architecture

Principles underlying modern ISA (Sections 2.1
2.10)
Compilers (Section 2.11)
Examples (Sections 2.12 and 2.13)
CISC vs. RISC (Section 2.1)
Recent advances

2
ISA Principles

Application area
Operands in CPU
ALU operands
Data storage endianness and alignment
Addressing modes
Operand type and size
Operations
Control instructions
Encoding

3
Dependence on Application Area

Desktop
Performance
Integer and floating point programs
Servers
Performance
Integer and character strings
Embedded systems
Code size
Realtime performance on continuous data streams
Hand optimized kernels

4
Operand Storage in CPU

Why in CPU?
Accumulator one implicit register (lt 1960)
Minimum hardware resources
High memory traffic
Stack LIFO storage (1960s 1970s and Java
Virtual Machine!)
Instructions implicitly access top of stack
Good code density
Stack can become bottleneck, especially with
pipelining
Registers 8 to 256 words (1960s ???)
Flexible temporaries and variables
Registers must be named
Most general purpose systems now use registers,
our focus too

5
Operand Storage in CPU

Why in CPU?
Faster access, Shorter address
Accumulator one implicit register (lt 1960)
Minimum hardware resources
High memory traffic
Stack LIFO storage (1960s 1970s and Java
Virtual Machine!)
Instructions implicitly access top of stack
Good code density
Stack can become bottleneck, especially with
pipelining
but stack can be cached
Registers 8 to 256 words (1960s ???)
Flexible temporaries and variables
Registers must be named
Most general purpose systems now use registers,
our focus too

6
Registers vs. Caches

7
Registers vs. Caches DitzelMcLellan1982

Register Advantages
Faster (no addressing modes, no tags)
Deterministic (no misses) ? can schedule for
pipeline
Small ? can duplicate for two ports
Short identifier (3 8 bits)
Register Disadvantages
Save/restore on procedure calls
Can't take the address of a register
Fixed size (FP, strings, structures)
Compiler must control (an advantage?)

8
How Many Registers?

More registers ?

9
How Many Registers?

More registers ?
Hold operands longer (decreases memory traffic,
execution time)
Longer register specifier
Slower registers
More state means slower context switches

10
ALU Operands

Number of explicit operands
Two (destination equals one source)
Small instruction
Three
Few instructions, Orthogonal
Number of operands for memory
Any (MemoryMemory), VAX
At least one register (RegisterMemory), IBM 360
Zero (LoadStore), MIPS, Alpha, SPARC, Cray
Fixedsize instructions
Simple code generation model all similar ALU
instructions take the same time
Facilitates pipelining no page faults, simple
decoding
Needs load/stores, higher instruction count

11
Endianness

Order of bytes in words
Big endian MSB at address xxxxx00
Little endian LSB at address xxxxx00
Big Endian (IBM, Motorola)
Word Address LSB MSB
0 0 1 2 3
4 4 5 6 7
Little Endian (DEC, Intel)
Word Address MSB LSB
0 3 2 1 0
4 7 6 5 4
Does not matter

12
Alignment

What is alignment?
Address mod Size 0
Example Aligned word (4 bytes)
Example Unaligned word (4 bytes)

3
2
1
0
2
1
0
3
13
Alignment (Cont.)

No restrictions on alignment ?
Software is simple
Hardware must detect misalignment and
(typically) make two memory accesses
Expensive logic
Usually slows down all references
Restricted alignment
Software must guarantee alignment
Hardware only detects misalignment and traps
Middle ground (VAX 8800)
Misaligned data ok, but slow
Traps on misaligned access, 10 cycles penalty

14
Addressing Modes

Possibilites
1. Register
2. Displacement
3. Immediate
4. Register deferred
5. Indexed
6. Absolute
7. Memory deferred
8. Autoincrement
9. Autodecrement
10. Scaled
Which modes to support and why?
Modes 14 account for 93 of all operands on the
VAX!
Displacement and immediate modes are most common

15
Addressing Modes (Cont.)

What length of displacements to support?
Figure 2.8
What length of immediates to support?
Figure 2.10

16
DSP Addressing Mode Examples

Modulo or circular addressing
Handles circular buffers for infinite continuous
streams
Bit reverse addressing
Handles shuffles in FFT
Compiler will find difficult to generate above
But lots of DSP applications use assembly code

17
Type and Size of Operands

Type usually encoded in opcode
Desktops and servers type also gives size
Character 1 byte
Half word 16 bits
Word 32 bits
Single-precision floating point 1 word
Double-precision floating point 2 words
Decimal less common
Packed data types for multimedia see 4 slides
later
Graphics
2D pixels x, y, z coordinates (z says which
images are visible)
3D add a coordinate for color and hidden
surfaces
Each coordinate is 8, 16, or 32 bits

18
Type and Size of Operands (Cont.)

DSP processors
Fixed point cheap floating point
Fraction between 1 and 1
Exponent is separate
Programmer must ensure alignment of result w/
exponent
Wide internal registers to avoid roundoff errors

19
Operations

Arithmetic and logical
Memory
Control
System
Floating point
Decimal
String
Graphics, multimedia, DSP
First four categories supported by all systems

20
Multimedia Instructions

Recent general-purpose processors include
multimedia instructions
Multimedia data derived from sampling analog
input
Correctness dictated by human perception
Smaller data types - 8-bit, 16-bit
Compare with 32 and 64 bit processor data paths
Significant levels of data parallelism
Large collection of small data elements
Identical processing of similar elements
e.g. Image Addition
For I 1 to 1024
For J 1 to 1024
destI,J
src1I,Jsrc2I,J

21
Multimedia - Packed Data Types

16 bits
Operand 1

48 bits are wasted! Can we use them in any way?
Operand 2

Result
64 bits
16 bits
4 operations in 1 cycle SPEEDUP 4X?? Called
SIMD single-instruction multiple-data
parallelism
22
Other Multimedia Extensions

Saturation arithmetic
Example image addition
Saturation ensures clamping of values

For I 1 to 1024 For J 1 to 1024 destI,J
src1I,Jsrc2I,J
If (dest gt 255) dest 255 If (dest lt 0)
dest 0
23
Other Multimedia Extensions (Cont.)

Sub-word Rearrangement
How do we go from unpacked data types to packed
data types?
Provide ISA support for pack, unpack, expand,
align,
Support for other types of sub-word rearrangement
Shift, rotate, permute, ...
E.g., for FFT butterfly algorithm
Many others
Conditional execution, memory instructions,
special-purpose instructions,

24
Example Intel MMX ISA Extensions
57 new instructions Use FP registers, 32-bit data
path, SIMD, saturation, ... More information
available from MMX Technology Overview, Intel
web site. http//developer.intel.com/drg/mmx/manua
ls/overview/
25
Example Intel SSE ISA Extensions

70 instructions
Separate register state, 128-bit data path,
alignment support, cache hints, SIMD,...
More information available from
The Internet Streaming SIMD Extensions,
Shreekanth Thakkar and Tom Huff, Intel Technology
Journal Q2, 1999. http//developer.intel.com/techn
ology/itj/q21999/articles/art_1.htm

26
Control Instructions

Example Conditional branches, unconditional
jumps, procedure calls/returns, O.S.
calls/returns
Key aspects
Taken or not taken?
Where is the target?

27
Taken or Not Taken

Compare and branch instruction
No extra compare instruction
No state is passed between instructions
Requires ALU operation
Condition codes (Z,N,V,C)
Condition in general purpose register
No special state to save and implement
Uses up a register
DSPs repeat instruction repeats loop specified
of times

28
Taken or Not Taken

Compare and branch instruction
No extra compare instruction
No state is passed between instructions
Requires ALU operation
Condition codes (Z,N,V,C)
Can be set for free
Constrains code reordering
Extra state to save and implement
Condition in general purpose register
No special state to save and implement
Uses up a register
DSPs repeat instruction repeats loop specified
of times

29
Taken or Not Taken (Cont.)

Some data for compare-and-branch
Figure 2.22

30
Where is the Target?

Could use arbitrary specifier
Powerful
More bits to specify
More time to decode
PCrelative with immediate
Position independence (helps linking)
Short immediate sufficient
HP - most instructions use less than 8 bits
(Figure 2.20)
Target must be known statically
Can't jump arbitrarily far other techniques
are required for returns and distance jumps

31
Where is the Target (Cont.)

Register
Short specification
Can jump anywhere
Dynamic target ok
Extra instruction to load register
(Vectored) Trap
Critical for O.S. calls
Common compromise
(Conditional) Branches pcrel with short
immediates
(Unconditional) Jumps pcrel, register
Procedure calls pcrel, register
Procedure returns and indirect jumps register
O.S. calls trap
O.S. returns register

32
Encoding the Instruction Set

Encoding affects size of program and
implementation
Depends on many aspects of the ISA
Variable length (opcode only tells number of
operands, not the type)
Minimize code size
Hard to pipeline
Fixed length (opcode tells number of operands and
address mode)
Easy to decode and pipeline
Increase code size
Hybrid approach

33
Compilers

Compilers form a GIANT case analysis
Too many choices make it hard
Provide orthogonal instruction sets
Operation
Addressing mode
Data type

34
Compilers (Cont.)

One solution or All possible solutions
2 Branch Conditions (EQ,LT)
Or all 6 (EQ,LT,GT,NE,LE,GE)
Not 3 or 4
Primitives, NOT Solutions
Sematic Clash ... by giving too much semantic
content to the instruction, the machine designer
made it possible to use the instruction only in
limited contexts.
In many of these cases, the highlevel
instructions are synthesized from more primitive
operations which, if the compiler writer could
access them, could be recomposed to more closely
model the features actually needed.''

35
Example ISA 1 MIPS64 (Section 2.8)

RISC architecture
32-bit byte addresses (aligned)
Load/Store, only immediate and displacement
addressing with 16 bits
Registers
32 64bit general purpose registers R0 to R31
(R0 always 0)
32 64bit floating point registers can use as
single or double precision
Control
Conditional branch 0 and ? 0
Jump - PC relative and register
Others for FP, linking, trap
Three fixed length instruction formats 32 bits
Special operations
Paired single two 32 bit FP on a 64 bit data
for graphics
Multiply-add for DSP

36
Example ISA 2 Trimedia CPU64 (Section 2.9)

Media processor
For multimedia workloads
Focused on parallelism
128 64bit registers for integer or floating
point
SIMD instructions, saturation arithmetic
Very Long Instruction Word (VLIW)
Multiple independent operations encoded in a
single instruction
Five operations for Trimedia CPU64
NOPs in instructions if five operations not
available
More on VLIW later
Compacts instructions in memory, decoded in
I-cache
25 functional units

37
Example ISA 3 - VAX

CISC architecture
Introduced by DEC in 1977
16 GPRs (r15 is PC, r14 is SP)
Extremely orthogonal, memorymemory
Decode as byte stream
Op code operation, number of operands
Variablelength address specifiers
Virtually all addressing modes
Includes complex instructions CRC, INSQUE

38
MIPS vs. VAX

VAX has too many modes and formats
Serial semantics can limit parallel
interpretation
The big deal with RISC is not REDUCED numbers of
instructions it is few modes and formats to
facilitate pipelining

39
CISC vs. RISC

Why CISC (60s and 70s)

Why RISC (70s and 80s)
40
CISC vs. RISC

Why CISC (60s and 70s)
Assembly programming
Small memory (dense encoding)
Microprogrammed control ? complex instructions ok

Why RISC (70s and 80s) Advances in
compilers Large memory, caches VLSI
(single-chip processor), pipelining, hardwired
control ? simple instructions
41
Outcome of RISC vs. CISC?
42
Outcome of RISC vs. CISC?

Millions of transistors per chip
Made sophisticated decoders for CISC possible
Focus on instruction-level parallelism
Decoding single instruction small part of
hardware and performance
Caches dominate chip
Internally, CISC processors decode into RISC-like
instructions and execute on microarchitecture
similar to RISCs
Above factors narrowed CISC vs. RISC gap
Non-technical issues played a large role

43
Recent Developments in Instruction Sets