Advanced Computer Architecture Unit 01: Some Basics - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Advanced Computer Architecture Unit 01: Some Basics

Description:

Today: determined by numerous time-of-flight issues gate delays ... Complete computing systems can be tiny and cheap. System on a chip. Resource efficiency ... – PowerPoint PPT presentation

Number of Views:142

Avg rating:3.0/5.0

Slides: 36

Provided by: socCsie

Category:

more less

Transcript and Presenter's Notes

Title: Advanced Computer Architecture Unit 01: Some Basics

1
Advanced Computer Architecture Unit 01 Some
Basics

Hsin-Chou Chi
Dept. of Computer Science Information
Engineering
National Dong Hwa University

2
Building Hardwarethat Computes
3
Finite State Machines

System state is explicit in representation
Transitions between states represented as arrows
with inputs on arcs.
Output may be either part of state or on arcs

Mod 3 Machine
Input (MSB first)
1
1
1
0
4
Implementation as Comb logic Latch
5
Microprogrammed Controllers

State machine in which part of state is a
micro-pc.
Explicit circuitry for incrementing or changing
PC
Includes a ROM with microinstructions.
Controlled logic implements at least branches and
jumps

6
Fundamental Execution Cycle
Memory
Obtain instruction from program storage
Processor
Determine required actions and instruction size
regs
Locate and obtain operand data
Data
F.U.s
Compute result value or status
von Neuman bottleneck
Deposit results in storage for later use
Determine successor instruction
7
Whats a Clock Cycle?
Latch or register
combinational logic

Old days 10 levels of gates
Today determined by numerous time-of-flight
issues gate delays
clock propagation, wire lengths, drivers

8
Pipelined Instruction Execution
Time (clock cycles)
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 6
Cycle 7
Cycle 5
I n s t r. O r d e r
9
Limits to pipelining

Maintain the von Neumann illusion of one
instruction at a time execution
Hazards prevent next instruction from executing
during its designated clock cycle
Structural hazards attempt to use the same
hardware to do two different things at once
Data hazards Instruction depends on result of
prior instruction still in the pipeline
Control hazards Caused by delay between the
fetching of instructions and decisions about
changes in control flow (branches and jumps).
Power Too many thing happening at once ? Melt
your chip!
Must disable parts of the system that are not
being used
Clock Gating, Asynchronous Design, Low Voltage
Swings,

10
Progression of ILP

1st generation RISC - pipelined
Full 32-bit processor fit on a chip gt issue
almost 1 IPC
Need to access memory 1x times per cycle
Floating-Point unit on another chip
Cache controller a third, off-chip cache
1 board per processor ? multiprocessor systems
2nd generation superscalar
Processor and floating point unit on chip (and
some cache)
Issuing only one instruction per cycle uses at
most half
Fetch multiple instructions, issue couple
Grows from 2 to 4 to 8
How to manage dependencies among all these
instructions?
Where does the parallelism come from?
VLIW
Expose some of the ILP to compiler, allow it to
schedule instructions to reduce dependences

11
Modern ILP

Dynamically scheduled, out-of-order execution
Current microprocessor fetch 10s of instructions
per cycle
Pipelines are 10s of cycles deep? many 10s of
instructions in execution at once
What happens
Grab a bunch of instructions, determine all their
dependences, eliminate deps wherever possible,
throw them all into the execution unit, let each
one move forward as its dependences are resolved
Appears as if executed sequentially
On a trap or interrupt, capture the state of the
machine between instructions perfectly
Huge complexity
Complexity of many components scales as n2 (issue
width)
Power consumption big problem

12
When all else fails - guess

Programs make decisions as they go
Conditionals, loops, calls
Translate into branches and jumps (1 of 5
instructions)
How do you determine what instructions for fetch
when the ones before it havent executed?
Branch prediction
Lots of clever machine structures to predict
future based on history
Machinery to back out of mis-predictions
Execute all the possible branches
Likely to hit additional branches, perform stores
speculative threads
What can hardware do to make programming (with
performance) easier?

13
Have we reached the end of ILP?

Multiple processor easily fit on a chip
Every major microprocessor vendor has gone to
multithreading
Thread loci of control, execution context
Fetch instructions from multiple threads at once,
throw them all into the execution unit
Intel hyperthreading, Sun
Concept has existed in high performance computing
for 20 years (or is it 40? CDC6600)
Vector processing
Each instruction processes many distinct data
Ex MMX
Raise the level of architecture many processors
per chip

Tensilica Configurable Proc
14
The Memory Abstraction

Association of ltname, valuegt pairs
typically named as byte addresses
often values aligned on multiples of size
Sequence of Reads and Writes
Write binds a value to an address
Read of addr returns most recently written value
bound to that address

command (R/W)
address (name)
data (W)
data (R)
done
15
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
16
Levels of the Memory Hierarchy
Upper Level
Capacity Access Time Cost
Staging Xfer Unit
faster
CPU Registers 100s Bytes ltlt 1s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache 10s-100s K Bytes 1 ns 1s/ MByte
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 100ns- 300ns lt 1/ MByte
Memory
OS 512-4K bytes
Pages
Disk 10s G Bytes, 10 ms (10,000,000 ns) 0.001/
MByte
Disk
user/operator Mbytes
Files
Larger
Tape infinite sec-min 0.0014/ MByte
Tape
Lower Level
circa 1995 numbers
17
The Principle of Locality

The Principle of Locality
Program access a relatively small portion of the
address space at any instant of time.
Two Different Types of Locality
Temporal Locality (Locality in Time) If an item
is referenced, it will tend to be referenced
again soon (e.g., loops, reuse)
Spatial Locality (Locality in Space) If an item
is referenced, items whose addresses are close by
tend to be referenced soon (e.g., straightline
code, array access)
Last 30 years, HW relied on locality for speed

MEM
P

18
The Cache Design Space

Several interacting dimensions
cache size
block size
associativity
replacement policy
write-through vs write-back
The optimal choice is a compromise
depends on access characteristics
workload
use (I-cache, D-cache, TLB)
depends on technology / cost
Simplicity often wins

Cache Size
Associativity
Block Size
Bad
Factor A
Factor B
Good
Less
More
19
Is it all about memory system design?

Modern microprocessors are almost all cache

20
Memory Abstraction and Parallelism

Maintaining the illusion of sequential access to
memory across distributed system
What happens when multiple processors access the
same memory at once?
Do they see a consistent picture?
Processing and processors embedded in the memory?

21
Is it all about communication?
Pentium IV Chipset
22
Breaking the HW/Software Boundary

Moores law (more and more trans) is all about
volume and regularity
What if you could pour nano-acres of unspecific
digital logic stuff onto silicon
Do anything with it. Very regular, large volume
Field Programmable Gate Arrays
Chip is covered with logic blocks w/ FFs, RAM
blocks, and interconnect
All three are programmable by setting
configuration bits
These are huge?
Can each program have its own instruction set?
Do we compile the program entirely into hardware?

23
Bells Law new class per decade
log (people per computer)
streaming information to/from physical world

Enabled by technological opportunities
Smaller, more numerous and more intimately
connected
Brings in a new kind of application
Used in many ways not previously imagined

year
24
Its not just about bigger and faster!

Complete computing systems can be tiny and cheap
System on a chip
Resource efficiency
Real-estate, power, pins,

25
Crossroads Conventional Wisdom in Comp. Arch

Old Conventional Wisdom Power is free,
Transistors expensive
New Conventional Wisdom Power wall Power
expensive, Xtors free (Can put more on chip than
can afford to turn on)
Old CW Sufficiently increasing Instruction Level
Parallelism via compilers, innovation
(Out-of-order, speculation, VLIW, )
New CW ILP wall law of diminishing returns on
more HW for ILP
Old CW Multiplies are slow, Memory access is
fast
New CW Memory wall Memory slow, multiplies
fast (200 clock cycles to DRAM memory, 4 clocks
for multiply)
Old CW Uniprocessor performance 2X / 1.5 yrs
New CW Power Wall ILP Wall Memory Wall
Brick Wall
Uniprocessor performance now 2X / 5(?) yrs
? Sea change in chip design multiple cores
(2X processors per chip / 2 years)
More simpler processors are more power efficient

26
Crossroads Uniprocessor Performance
From Hennessy and Patterson, Computer
Architecture A Quantitative Approach, 4th
edition, October, 2006

VAX 25/year 1978 to 1986
RISC x86 52/year 1986 to 2002
RISC x86 ??/year 2002 to present

27
Sea Change in Chip Design

Intel 4004 (1971) 4-bit processor,2312
transistors, 0.4 MHz, 10 micron PMOS, 11 mm2
chip

RISC II (1983) 32-bit, 5 stage pipeline, 40,760
transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip

125 mm2 chip, 0.065 micron CMOS 2312 RISC
IIFPUIcacheDcache
RISC II shrinks to 0.02 mm2 at 65 nm
Caches via DRAM or 1 transistor SRAM
(www.t-ram.com) ?
Proximity Communication via capacitive coupling
at gt 1 TB/s ?(Ivan Sutherland _at_ Sun / Berkeley)

Processor is the new transistor?

28
Problems with Sea Change

Algorithms, Programming Languages, Compilers,
Operating Systems, Architectures, Libraries,
not ready to supply Thread Level Parallelism or
Data Level Parallelism for 1000 CPUs / chip,
Architectures not ready for 1000 CPUs / chip
Unlike Instruction Level Parallelism, cannot be
solved by just by computer architects and
compiler writers alone, but also cannot be solved
without participation of computer architects

29
Quantifying the Design Process
30
Focus on the Common Case

Common sense guides computer design
Since its engineering, common sense is valuable
In making a design trade-off, favor the frequent
case over the infrequent case
E.g., Instruction fetch and decode unit used more
frequently than multiplier, so optimize it 1st
E.g., If database server has 50 disks /
processor, storage dependability dominates system
dependability, so optimize it 1st
Frequent case is often simpler and can be done
faster than the infrequent case
E.g., overflow is rare when adding 2 numbers, so
improve performance by optimizing more common
case of no overflow
May slow down overflow, but overall performance
improved by optimizing for the normal case
What is frequent case and how much performance
improved by making case faster gt Amdahls Law