Title: Chapter 2: Instruction Set Architecture
1Chapter 2 Instruction Set Architecture
- Principles underlying modern ISA (Sections 2.1
2.10) - Compilers (Section 2.11)
- Examples (Sections 2.12 and 2.13)
- CISC vs. RISC (Section 2.1)
- Recent advances
2ISA Principles
- Application area
- Operands in CPU
- ALU operands
- Data storage endianness and alignment
- Addressing modes
- Operand type and size
- Operations
- Control instructions
- Encoding
3Dependence on Application Area
- Desktop
- Performance
- Integer and floating point programs
- Servers
- Performance
- Integer and character strings
- Embedded systems
- Code size
- Realtime performance on continuous data streams
- Hand optimized kernels
4Operand Storage in CPU
- Why in CPU?
- Accumulator one implicit register (lt 1960)
- Minimum hardware resources
- High memory traffic
- Stack LIFO storage (1960s 1970s and Java
Virtual Machine!) - Instructions implicitly access top of stack
- Good code density
- Stack can become bottleneck, especially with
pipelining - Registers 8 to 256 words (1960s ???)
- Flexible temporaries and variables
- Registers must be named
- Most general purpose systems now use registers,
our focus too
5Operand Storage in CPU
- Why in CPU?
- Faster access, Shorter address
- Accumulator one implicit register (lt 1960)
- Minimum hardware resources
- High memory traffic
- Stack LIFO storage (1960s 1970s and Java
Virtual Machine!) - Instructions implicitly access top of stack
- Good code density
- Stack can become bottleneck, especially with
pipelining - but stack can be cached
- Registers 8 to 256 words (1960s ???)
- Flexible temporaries and variables
- Registers must be named
- Most general purpose systems now use registers,
our focus too
6Registers vs. Caches
- Register Advantages
-
-
-
-
- Register Disadvantages
-
-
-
7Registers vs. Caches DitzelMcLellan1982
- Register Advantages
- Faster (no addressing modes, no tags)
- Deterministic (no misses) ? can schedule for
pipeline - Small ? can duplicate for two ports
- Short identifier (3 8 bits)
- Register Disadvantages
- Save/restore on procedure calls
- Can't take the address of a register
- Fixed size (FP, strings, structures)
- Compiler must control (an advantage?)
8How Many Registers?
9How Many Registers?
- More registers ?
- Hold operands longer (decreases memory traffic,
execution time) - Longer register specifier
- Slower registers
- More state means slower context switches
10ALU Operands
- Number of explicit operands
- Two (destination equals one source)
- Small instruction
- Three
- Few instructions, Orthogonal
- Number of operands for memory
- Any (MemoryMemory), VAX
- At least one register (RegisterMemory), IBM 360
- Zero (LoadStore), MIPS, Alpha, SPARC, Cray
- Fixedsize instructions
- Simple code generation model all similar ALU
instructions take the same time - Facilitates pipelining no page faults, simple
decoding - Needs load/stores, higher instruction count
11Endianness
- Order of bytes in words
- Big endian MSB at address xxxxx00
- Little endian LSB at address xxxxx00
- Big Endian (IBM, Motorola)
- Word Address LSB MSB
- 0 0 1 2 3
- 4 4 5 6 7
- Little Endian (DEC, Intel)
- Word Address MSB LSB
- 0 3 2 1 0
- 4 7 6 5 4
- Does not matter
12Alignment
- What is alignment?
- Address mod Size 0
- Example Aligned word (4 bytes)
-
- Example Unaligned word (4 bytes)
3
2
1
0
2
1
0
3
13Alignment (Cont.)
- No restrictions on alignment ?
- Software is simple
- Hardware must detect misalignment and
(typically) make two memory accesses - Expensive logic
- Usually slows down all references
- Restricted alignment
- Software must guarantee alignment
- Hardware only detects misalignment and traps
- Middle ground (VAX 8800)
- Misaligned data ok, but slow
- Traps on misaligned access, 10 cycles penalty
14Addressing Modes
- Possibilites
- 1. Register
- 2. Displacement
- 3. Immediate
- 4. Register deferred
- 5. Indexed
- 6. Absolute
- 7. Memory deferred
- 8. Autoincrement
- 9. Autodecrement
- 10. Scaled
- Which modes to support and why?
- Modes 14 account for 93 of all operands on the
VAX! - Displacement and immediate modes are most common
15Addressing Modes (Cont.)
- What length of displacements to support?
- Figure 2.8
- What length of immediates to support?
- Figure 2.10
16DSP Addressing Mode Examples
- Modulo or circular addressing
- Handles circular buffers for infinite continuous
streams - Bit reverse addressing
- Handles shuffles in FFT
- Compiler will find difficult to generate above
- But lots of DSP applications use assembly code
17Type and Size of Operands
- Type usually encoded in opcode
- Desktops and servers type also gives size
- Character 1 byte
- Half word 16 bits
- Word 32 bits
- Single-precision floating point 1 word
- Double-precision floating point 2 words
- Decimal less common
- Packed data types for multimedia see 4 slides
later - Graphics
- 2D pixels x, y, z coordinates (z says which
images are visible) - 3D add a coordinate for color and hidden
surfaces - Each coordinate is 8, 16, or 32 bits
18Type and Size of Operands (Cont.)
- DSP processors
- Fixed point cheap floating point
- Fraction between 1 and 1
- Exponent is separate
- Programmer must ensure alignment of result w/
exponent - Wide internal registers to avoid roundoff errors
19Operations
- Arithmetic and logical
- Memory
- Control
- System
- Floating point
- Decimal
- String
- Graphics, multimedia, DSP
- First four categories supported by all systems
20Multimedia Instructions
- Recent general-purpose processors include
multimedia instructions - Multimedia data derived from sampling analog
input - Correctness dictated by human perception
- Smaller data types - 8-bit, 16-bit
- Compare with 32 and 64 bit processor data paths
- Significant levels of data parallelism
- Large collection of small data elements
- Identical processing of similar elements
- e.g. Image Addition
- For I 1 to 1024
- For J 1 to 1024
- destI,J
- src1I,Jsrc2I,J
21Multimedia - Packed Data Types
16 bits
Operand 1
48 bits are wasted! Can we use them in any way?
Operand 2
Result
64 bits
16 bits
4 operations in 1 cycle SPEEDUP 4X?? Called
SIMD single-instruction multiple-data
parallelism
22Other Multimedia Extensions
- Saturation arithmetic
- Example image addition
- Saturation ensures clamping of values
For I 1 to 1024 For J 1 to 1024 destI,J
src1I,Jsrc2I,J
If (dest gt 255) dest 255 If (dest lt 0)
dest 0
23Other Multimedia Extensions (Cont.)
- Sub-word Rearrangement
- How do we go from unpacked data types to packed
data types? - Provide ISA support for pack, unpack, expand,
align, - Support for other types of sub-word rearrangement
- Shift, rotate, permute, ...
- E.g., for FFT butterfly algorithm
- Many others
- Conditional execution, memory instructions,
special-purpose instructions,
24Example Intel MMX ISA Extensions
57 new instructions Use FP registers, 32-bit data
path, SIMD, saturation, ... More information
available from MMX Technology Overview, Intel
web site. http//developer.intel.com/drg/mmx/manua
ls/overview/
25Example Intel SSE ISA Extensions
- 70 instructions
- Separate register state, 128-bit data path,
alignment support, cache hints, SIMD,... - More information available from
- The Internet Streaming SIMD Extensions,
Shreekanth Thakkar and Tom Huff, Intel Technology
Journal Q2, 1999. http//developer.intel.com/techn
ology/itj/q21999/articles/art_1.htm
26Control Instructions
- Example Conditional branches, unconditional
jumps, procedure calls/returns, O.S.
calls/returns - Key aspects
- Taken or not taken?
- Where is the target?
27Taken or Not Taken
- Compare and branch instruction
- No extra compare instruction
- No state is passed between instructions
- Requires ALU operation
- Condition codes (Z,N,V,C)
-
-
-
- Condition in general purpose register
- No special state to save and implement
- Uses up a register
- DSPs repeat instruction repeats loop specified
of times
28Taken or Not Taken
- Compare and branch instruction
- No extra compare instruction
- No state is passed between instructions
- Requires ALU operation
- Condition codes (Z,N,V,C)
- Can be set for free
- Constrains code reordering
- Extra state to save and implement
- Condition in general purpose register
- No special state to save and implement
- Uses up a register
- DSPs repeat instruction repeats loop specified
of times
29Taken or Not Taken (Cont.)
- Some data for compare-and-branch
- Figure 2.22
30Where is the Target?
- Could use arbitrary specifier
- Powerful
- More bits to specify
- More time to decode
- PCrelative with immediate
- Position independence (helps linking)
- Short immediate sufficient
- HP - most instructions use less than 8 bits
(Figure 2.20) - Target must be known statically
- Can't jump arbitrarily far other techniques
are required for returns and distance jumps
31Where is the Target (Cont.)
- Register
- Short specification
- Can jump anywhere
- Dynamic target ok
- Extra instruction to load register
- (Vectored) Trap
- Critical for O.S. calls
- Common compromise
- (Conditional) Branches pcrel with short
immediates - (Unconditional) Jumps pcrel, register
- Procedure calls pcrel, register
- Procedure returns and indirect jumps register
- O.S. calls trap
- O.S. returns register
32Encoding the Instruction Set
- Encoding affects size of program and
implementation - Depends on many aspects of the ISA
- Variable length (opcode only tells number of
operands, not the type) - Minimize code size
- Hard to pipeline
- Fixed length (opcode tells number of operands and
address mode) - Easy to decode and pipeline
- Increase code size
- Hybrid approach
33Compilers
- Compilers form a GIANT case analysis
- Too many choices make it hard
- Provide orthogonal instruction sets
- Operation
- Addressing mode
- Data type
34Compilers (Cont.)
- One solution or All possible solutions
- 2 Branch Conditions (EQ,LT)
- Or all 6 (EQ,LT,GT,NE,LE,GE)
- Not 3 or 4
- Primitives, NOT Solutions
- Sematic Clash ... by giving too much semantic
content to the instruction, the machine designer
made it possible to use the instruction only in
limited contexts. - In many of these cases, the highlevel
instructions are synthesized from more primitive
operations which, if the compiler writer could
access them, could be recomposed to more closely
model the features actually needed.''
35Example ISA 1 MIPS64 (Section 2.8)
- RISC architecture
- 32-bit byte addresses (aligned)
- Load/Store, only immediate and displacement
addressing with 16 bits - Registers
- 32 64bit general purpose registers R0 to R31
(R0 always 0) - 32 64bit floating point registers can use as
single or double precision - Control
- Conditional branch 0 and ? 0
- Jump - PC relative and register
- Others for FP, linking, trap
- Three fixed length instruction formats 32 bits
- Special operations
- Paired single two 32 bit FP on a 64 bit data
for graphics - Multiply-add for DSP
36Example ISA 2 Trimedia CPU64 (Section 2.9)
- Media processor
- For multimedia workloads
- Focused on parallelism
- 128 64bit registers for integer or floating
point - SIMD instructions, saturation arithmetic
- Very Long Instruction Word (VLIW)
- Multiple independent operations encoded in a
single instruction - Five operations for Trimedia CPU64
- NOPs in instructions if five operations not
available - More on VLIW later
- Compacts instructions in memory, decoded in
I-cache - 25 functional units
37Example ISA 3 - VAX
- CISC architecture
- Introduced by DEC in 1977
- 16 GPRs (r15 is PC, r14 is SP)
- Extremely orthogonal, memorymemory
- Decode as byte stream
- Op code operation, number of operands
- Variablelength address specifiers
- Virtually all addressing modes
- Includes complex instructions CRC, INSQUE
38MIPS vs. VAX
- VAX has too many modes and formats
- Serial semantics can limit parallel
interpretation - The big deal with RISC is not REDUCED numbers of
instructions it is few modes and formats to
facilitate pipelining
39CISC vs. RISC
Why RISC (70s and 80s)
40CISC vs. RISC
- Why CISC (60s and 70s)
- Assembly programming
- Small memory (dense encoding)
- Microprogrammed control ? complex instructions ok
Why RISC (70s and 80s) Advances in
compilers Large memory, caches VLSI
(single-chip processor), pipelining, hardwired
control ? simple instructions
41Outcome of RISC vs. CISC?
42Outcome of RISC vs. CISC?
- Millions of transistors per chip
- Made sophisticated decoders for CISC possible
- Focus on instruction-level parallelism
- Decoding single instruction small part of
hardware and performance - Caches dominate chip
- Internally, CISC processors decode into RISC-like
instructions and execute on microarchitecture
similar to RISCs - Above factors narrowed CISC vs. RISC gap
- Non-technical issues played a large role
43Recent Developments in Instruction Sets
- Branches important ? Predication
- Memory latency important ? Speculative loads,
prefetching - Multimedia applications ? Multimedia ISA
extensions - Embedded processors ? Variable length instructions