Advanced Computer Architecture Unit 01: Some Basics - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Advanced Computer Architecture Unit 01: Some Basics

Description:

Today: determined by numerous time-of-flight issues gate delays ... Complete computing systems can be tiny and cheap. System on a chip. Resource efficiency ... – PowerPoint PPT presentation

Number of Views:142
Avg rating:3.0/5.0
Slides: 36
Provided by: socCsie
Category:

less

Transcript and Presenter's Notes

Title: Advanced Computer Architecture Unit 01: Some Basics


1
Advanced Computer Architecture Unit 01 Some
Basics
  • Hsin-Chou Chi
  • Dept. of Computer Science Information
    Engineering
  • National Dong Hwa University

2
Building Hardwarethat Computes
3
Finite State Machines
  • System state is explicit in representation
  • Transitions between states represented as arrows
    with inputs on arcs.
  • Output may be either part of state or on arcs

Mod 3 Machine
Input (MSB first)
1
1
1
0
4
Implementation as Comb logic Latch
5
Microprogrammed Controllers
  • State machine in which part of state is a
    micro-pc.
  • Explicit circuitry for incrementing or changing
    PC
  • Includes a ROM with microinstructions.
  • Controlled logic implements at least branches and
    jumps

6
Fundamental Execution Cycle
Memory
Obtain instruction from program storage
Processor
Determine required actions and instruction size
regs
Locate and obtain operand data
Data
F.U.s
Compute result value or status
von Neuman bottleneck
Deposit results in storage for later use
Determine successor instruction
7
Whats a Clock Cycle?
Latch or register
combinational logic
  • Old days 10 levels of gates
  • Today determined by numerous time-of-flight
    issues gate delays
  • clock propagation, wire lengths, drivers

8
Pipelined Instruction Execution
Time (clock cycles)
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 6
Cycle 7
Cycle 5
I n s t r. O r d e r
9
Limits to pipelining
  • Maintain the von Neumann illusion of one
    instruction at a time execution
  • Hazards prevent next instruction from executing
    during its designated clock cycle
  • Structural hazards attempt to use the same
    hardware to do two different things at once
  • Data hazards Instruction depends on result of
    prior instruction still in the pipeline
  • Control hazards Caused by delay between the
    fetching of instructions and decisions about
    changes in control flow (branches and jumps).
  • Power Too many thing happening at once ? Melt
    your chip!
  • Must disable parts of the system that are not
    being used
  • Clock Gating, Asynchronous Design, Low Voltage
    Swings,

10
Progression of ILP
  • 1st generation RISC - pipelined
  • Full 32-bit processor fit on a chip gt issue
    almost 1 IPC
  • Need to access memory 1x times per cycle
  • Floating-Point unit on another chip
  • Cache controller a third, off-chip cache
  • 1 board per processor ? multiprocessor systems
  • 2nd generation superscalar
  • Processor and floating point unit on chip (and
    some cache)
  • Issuing only one instruction per cycle uses at
    most half
  • Fetch multiple instructions, issue couple
  • Grows from 2 to 4 to 8
  • How to manage dependencies among all these
    instructions?
  • Where does the parallelism come from?
  • VLIW
  • Expose some of the ILP to compiler, allow it to
    schedule instructions to reduce dependences

11
Modern ILP
  • Dynamically scheduled, out-of-order execution
  • Current microprocessor fetch 10s of instructions
    per cycle
  • Pipelines are 10s of cycles deep? many 10s of
    instructions in execution at once
  • What happens
  • Grab a bunch of instructions, determine all their
    dependences, eliminate deps wherever possible,
    throw them all into the execution unit, let each
    one move forward as its dependences are resolved
  • Appears as if executed sequentially
  • On a trap or interrupt, capture the state of the
    machine between instructions perfectly
  • Huge complexity
  • Complexity of many components scales as n2 (issue
    width)
  • Power consumption big problem

12
When all else fails - guess
  • Programs make decisions as they go
  • Conditionals, loops, calls
  • Translate into branches and jumps (1 of 5
    instructions)
  • How do you determine what instructions for fetch
    when the ones before it havent executed?
  • Branch prediction
  • Lots of clever machine structures to predict
    future based on history
  • Machinery to back out of mis-predictions
  • Execute all the possible branches
  • Likely to hit additional branches, perform stores
  • speculative threads
  • What can hardware do to make programming (with
    performance) easier?

13
Have we reached the end of ILP?
  • Multiple processor easily fit on a chip
  • Every major microprocessor vendor has gone to
    multithreading
  • Thread loci of control, execution context
  • Fetch instructions from multiple threads at once,
    throw them all into the execution unit
  • Intel hyperthreading, Sun
  • Concept has existed in high performance computing
    for 20 years (or is it 40? CDC6600)
  • Vector processing
  • Each instruction processes many distinct data
  • Ex MMX
  • Raise the level of architecture many processors
    per chip

Tensilica Configurable Proc
14
The Memory Abstraction
  • Association of ltname, valuegt pairs
  • typically named as byte addresses
  • often values aligned on multiples of size
  • Sequence of Reads and Writes
  • Write binds a value to an address
  • Read of addr returns most recently written value
    bound to that address

command (R/W)
address (name)
data (W)
data (R)
done
15
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
16
Levels of the Memory Hierarchy
Upper Level
Capacity Access Time Cost
Staging Xfer Unit
faster
CPU Registers 100s Bytes ltlt 1s ns
Registers
prog./compiler 1-8 bytes
Instr. Operands
Cache 10s-100s K Bytes 1 ns 1s/ MByte
Cache
cache cntl 8-128 bytes
Blocks
Main Memory M Bytes 100ns- 300ns lt 1/ MByte
Memory
OS 512-4K bytes
Pages
Disk 10s G Bytes, 10 ms (10,000,000 ns) 0.001/
MByte
Disk
user/operator Mbytes
Files
Larger
Tape infinite sec-min 0.0014/ MByte
Tape
Lower Level
circa 1995 numbers
17
The Principle of Locality
  • The Principle of Locality
  • Program access a relatively small portion of the
    address space at any instant of time.
  • Two Different Types of Locality
  • Temporal Locality (Locality in Time) If an item
    is referenced, it will tend to be referenced
    again soon (e.g., loops, reuse)
  • Spatial Locality (Locality in Space) If an item
    is referenced, items whose addresses are close by
    tend to be referenced soon (e.g., straightline
    code, array access)
  • Last 30 years, HW relied on locality for speed

MEM
P

18
The Cache Design Space
  • Several interacting dimensions
  • cache size
  • block size
  • associativity
  • replacement policy
  • write-through vs write-back
  • The optimal choice is a compromise
  • depends on access characteristics
  • workload
  • use (I-cache, D-cache, TLB)
  • depends on technology / cost
  • Simplicity often wins

Cache Size
Associativity
Block Size
Bad
Factor A
Factor B
Good
Less
More
19
Is it all about memory system design?
  • Modern microprocessors are almost all cache

20
Memory Abstraction and Parallelism
  • Maintaining the illusion of sequential access to
    memory across distributed system
  • What happens when multiple processors access the
    same memory at once?
  • Do they see a consistent picture?
  • Processing and processors embedded in the memory?

21
Is it all about communication?
Pentium IV Chipset
22
Breaking the HW/Software Boundary
  • Moores law (more and more trans) is all about
    volume and regularity
  • What if you could pour nano-acres of unspecific
    digital logic stuff onto silicon
  • Do anything with it. Very regular, large volume
  • Field Programmable Gate Arrays
  • Chip is covered with logic blocks w/ FFs, RAM
    blocks, and interconnect
  • All three are programmable by setting
    configuration bits
  • These are huge?
  • Can each program have its own instruction set?
  • Do we compile the program entirely into hardware?

23
Bells Law new class per decade
log (people per computer)
streaming information to/from physical world
  • Enabled by technological opportunities
  • Smaller, more numerous and more intimately
    connected
  • Brings in a new kind of application
  • Used in many ways not previously imagined

year
24
Its not just about bigger and faster!
  • Complete computing systems can be tiny and cheap
  • System on a chip
  • Resource efficiency
  • Real-estate, power, pins,

25
Crossroads Conventional Wisdom in Comp. Arch
  • Old Conventional Wisdom Power is free,
    Transistors expensive
  • New Conventional Wisdom Power wall Power
    expensive, Xtors free (Can put more on chip than
    can afford to turn on)
  • Old CW Sufficiently increasing Instruction Level
    Parallelism via compilers, innovation
    (Out-of-order, speculation, VLIW, )
  • New CW ILP wall law of diminishing returns on
    more HW for ILP
  • Old CW Multiplies are slow, Memory access is
    fast
  • New CW Memory wall Memory slow, multiplies
    fast (200 clock cycles to DRAM memory, 4 clocks
    for multiply)
  • Old CW Uniprocessor performance 2X / 1.5 yrs
  • New CW Power Wall ILP Wall Memory Wall
    Brick Wall
  • Uniprocessor performance now 2X / 5(?) yrs
  • ? Sea change in chip design multiple cores
    (2X processors per chip / 2 years)
  • More simpler processors are more power efficient

26
Crossroads Uniprocessor Performance
From Hennessy and Patterson, Computer
Architecture A Quantitative Approach, 4th
edition, October, 2006
  • VAX 25/year 1978 to 1986
  • RISC x86 52/year 1986 to 2002
  • RISC x86 ??/year 2002 to present

27
Sea Change in Chip Design
  • Intel 4004 (1971) 4-bit processor,2312
    transistors, 0.4 MHz, 10 micron PMOS, 11 mm2
    chip
  • RISC II (1983) 32-bit, 5 stage pipeline, 40,760
    transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip
  • 125 mm2 chip, 0.065 micron CMOS 2312 RISC
    IIFPUIcacheDcache
  • RISC II shrinks to 0.02 mm2 at 65 nm
  • Caches via DRAM or 1 transistor SRAM
    (www.t-ram.com) ?
  • Proximity Communication via capacitive coupling
    at gt 1 TB/s ?(Ivan Sutherland _at_ Sun / Berkeley)
  • Processor is the new transistor?

28
Problems with Sea Change
  • Algorithms, Programming Languages, Compilers,
    Operating Systems, Architectures, Libraries,
    not ready to supply Thread Level Parallelism or
    Data Level Parallelism for 1000 CPUs / chip,
  • Architectures not ready for 1000 CPUs / chip
  • Unlike Instruction Level Parallelism, cannot be
    solved by just by computer architects and
    compiler writers alone, but also cannot be solved
    without participation of computer architects

29
Quantifying the Design Process
30
Focus on the Common Case
  • Common sense guides computer design
  • Since its engineering, common sense is valuable
  • In making a design trade-off, favor the frequent
    case over the infrequent case
  • E.g., Instruction fetch and decode unit used more
    frequently than multiplier, so optimize it 1st
  • E.g., If database server has 50 disks /
    processor, storage dependability dominates system
    dependability, so optimize it 1st
  • Frequent case is often simpler and can be done
    faster than the infrequent case
  • E.g., overflow is rare when adding 2 numbers, so
    improve performance by optimizing more common
    case of no overflow
  • May slow down overflow, but overall performance
    improved by optimizing for the normal case
  • What is frequent case and how much performance
    improved by making case faster gt Amdahls Law

31
Processor performance equation
CPI
inst count
Cycle time
  • Inst Count CPI Clock Rate
  • Program X
  • Compiler X (X)
  • Inst. Set. X X
  • Organization X X
  • Technology X

32
Amdahls Law
Best you could ever hope to do
33
Amdahls Law example
  • New CPU 10X faster
  • I/O bound server, so 60 time waiting for I/O
  • Apparently, its human nature to be attracted by
    10X faster, vs. keeping in perspective its just
    1.6X faster

34
The Process of Design
  • Architecture is an iterative process
  • Searching the space of possible designs
  • At all levels of computer systems

Creativity
Cost / Performance Analysis
Good Ideas
Mediocre Ideas
Bad Ideas
35
And in conclusion
  • Computer Architecture skill sets are different
  • Quantitative approach to design
  • Solid interfaces that really work
  • Technology tracking and anticipation
  • Computer Science at the crossroads from
    sequential to parallel computing
  • Salvation requires innovation in many fields,
    including computer architecture
Write a Comment
User Comments (0)
About PowerShow.com