CSCI 620 - PowerPoint PPT Presentation

About This Presentation
Title:

CSCI 620

Description:

Multiprogramming and time-sharing OS, multiuser applications ... PDAs, cell-phones, sensors, in cars = Price, Energy efficiency main concerns ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 83
Provided by: csuchicos
Category:
Tags: csci

less

Transcript and Presenter's Notes

Title: CSCI 620


1
  • CSCI 620
  • Computer Architecture
  • Introduction
  • Seung Bae Im
  • Spring 2007
  • Reading Assignments
  • Chapter 1

2
Text Web pages
  • Text Computer Architecture A Quantitative
    Approach, Fourth Edition, Hennessy Patterson
  • Web Page www.ecst.csuhico.edu/sim

3
Prerequisite Knowledge (CSCI 320)
  • Assembly language programming
  • Fundamentals of logic design
  • Combinational and sequential logics (e.g., gates,
    multiplexers, decoders, ALU, ROMs,
    flip-flops, registers, RAMs)
  • Processor Design
  • Instruction cycle, pipelining, branch prediction,
    exceptions
  • Memory Hierarchy
  • Caches (direct-mapped, fully-associative, n-way
    set associative), spatial locality, temporal
    locality, virtual memory, translation lookaside
    buffer (TLB)
  • Disk systems
  • Input and Output
  • Polling, interrupts
  • Multiprocessors

4
Course Requirements
  • Homeworks Specifications on web page
  • 1 Midterm Exam Final Exam
  • Course project Specifications on web page
  • (A report plus a group presentation at the end
    of the semester)

Course Evaluation Homeworks 15 Midterm
exam 30 Final exam 30 Project 25
5
What is Computer Architecture?
  • Two viewpoints
  • Hardware designers viewpoint CPUs, caches,
    buses, pipelines, memory, ILP, etc.
  • Programmers viewpoint instruction set
    opcodes, addressing modes, registers, virtual
    memory, etc. ISA(Instruction Set Architecture)
  • ? Study of architecture covers both

6
Computer Architecture Is ...
  • The attributes of a computing system as seen
    by the programmer, i.e., the conceptual structure
    and functional behavior, as distinct from the
    organization of the data flows and controls the
    logic design, and the physical implementation.
  • Amdahl, Blaauw, and Brooks, 1964

7
Why learn computer architecture?
  • How the hardware work?
  • Where is the hardware heading toward?
  • How does the advances in hardware affect system
    software and applications?
  • -Understand OS/compilers/Programming Language
  • -Helps us to write better code
  • -Use computers more efficiently
  • Future of computing technology?

8
Computer Architectures Changing Definition
  • 1950s to 1960s
  • Computer Architecture Course Computer
    Arithmetic. Logic design
  • 1970s to 1980s
  • Computer Architecture Course Instruction Set
    Design, especially ISA appropriate for compilers.
    CISC vs RISC
  • 1990s and beyond
  • Computer Architecture Course Design of CPU,
    memory system, I/O system, multiprocessors.

9
5 Generations of Electronic Computers(60 years)
1st Generation (1945-54) Beginning (Vacuum tubes) Vacuum tubes, relay memories, CPU using PC, accumulator, fixed point arithmetic Machine/assembly languages, single user, batch processing, no subroutine linkages, programmed I/O (controlled by CPU) ENIAC, IBM 701
2nd Generation (1955-64) Mainframes (Transistors) Discrete transistors, core memories, floating-point arithmetic, I/O processors, multiplexed memory access High Level Languages using compilers, subroutines, libraries, batch processing terminals IBM 7030, CDC 1604, Univac LARC (mainframes)
3rd Generation (1965-74) Mainframe, Miniframe (Integrated Circuits) Integrated Circuits (SSI/MSI), pipelining, cache, lookahead Multiprogramming and time-sharing OS, multiuser applications IBM 360/370, CDC 6600, TI-ASC, PDP-8 (mainframe, miniframe)
4th Generation (1975-90) Microprocessors (VLSI) LSI/VLSI, semiconductor memory,multiprocessors, vector supercomputers, multiprocessors, parallel processors Multiprocessor OS, Higher Level Languages, advanced compilers, parallel processing, RISC ISA VAX/9000, CRAY X/MP, IBM 3090, Intel, Motorola, MIPS
5th Generation (1991-present) Dominance of microprocessors ULSI/VHSIC, high density memories, scalable architectures, Massively parallel processing, Instruction Level Parallelism, large/complex applications, ubiquitous computing IBM/MPP, CRAY/MPP, TMC/CM-5, Intel Paragon
10
Moores Law
  • In 1965, Intel co-founder Gordon Moore saw the
    future. His prediction, now popularly known as
    Moore's Law, states that the number of
    transistors on a chip doubles about every two
    years. This observation about silicon
    integration, made a reality by Intel, the world's
    largest silicon supplier, has fueled the
    worldwide technology revolution.
  • Every 18-24 months
  • Feature sizes shrink by 0.7x
  • Number of transistors per die increases by 2x
  • Speed of transistors increases by 1.4x
  • But we are starting to hit some roadblocks
  • Also, what to do with all of these
    transistors???

11
Moores Law (Transistor density)Intel only
Moore's Law Means More Performance. Processing
power, measured in millions of instructions per
second (MIPS), has steadily risen because of
increased transistor counts.
12
The transistors have been used to
  • More functionality on one chip
  • Early 1980s 32-bit microprocessors
  • Late 1980s On Chip Level 1 Caches
  • Early/Mid 1990s 64-bit microprocessors,
    superscalar (ILP)
  • Late 1990s On Chip Level 2 Caches
  • Early 2000s Chip-level Multiprocessors,
    On-Chip Level 3 Caches
  • What next?
  • How much cache on a chip? (Itanium 2up to
    level 3 on-chip)
  • How many cores on a chip? ( SUN Niagara8
    CPUs)
  • What else can we put on chips?
  • The Trend to Many-Core from Intel siteRattner
    introduced the "many-core" concept, explaining
    that Intel researchers and scientists are
    experimenting with "many tens of cores,
    potentially even hundreds of cores per die, per
    single processor die. And those cores will be
    supporting tens, hundreds, maybe even thousands
    of simultaneous execution threads."

13
Crossroads Conventional Wisdom in Comp. Arch
  • Old Conventional Wisdom Power is free,
    Transistors expensive
  • New Conventional Wisdom Power wall Power
    expensive, Xtors free (Can put more on chip than
    can afford to turn on)
  • Old CW Sufficiently increasing Instruction Level
    Parallelism via compilers, innovation
    (Out-of-order, speculation, VLIW, )
  • New CW ILP wall law of diminishing returns on
    more HW for ILP
  • Old CW Multiplies are slow, Memory access is
    fast
  • New CW Memory wall Memory slow, multiplies
    fast (200 clock cycles to DRAM memory, 4 clocks
    for multiply)
  • Old CW Uniprocessor performance 2X / 1.5 yrs
  • New CW Power Wall ILP Wall Memory Wall
    Brick Wall
  • Uniprocessor performance now 2X / 5(?) yrs
  • ? Sea change in chip design multiple cores
    (2X processors per chip / 2 years)
  • More simpler processors are more power efficient

14
Power wall
15
Crossroads Uniprocessor Performance
From Hennessy and Patterson, Computer
Architecture A Quantitative Approach, 4th
edition, October, 2006
  • VAX 25/year 1978 to 1986
  • RISC x86 52/year 1986 to 2002
  • RISC x86 ??/year 2002 to presentabout
    20/year

VAX
16
Performance of Microprocessors (previous page)
  • Up to late 70s(before the birth of
    microprocessors), mainframe or miniframe
    computers dominated. Their performance improved
    constantly
  • The improvement was due to the technological
    advances in electronics and innovation in
    computer architecturetheir growth rate was
    25-30 per year
  • From late 70s, microprocessors (which benefited
    more from IC technologies than main/miniframes)
    have shown 35 growth rate during early 80s
  • During 80s, a new ISA(Instruction Set
    Architecture), RISC architecture evolved which
    was possible due to two reasons
  • - Less object-code compatibility due to the
    usage of HLLless assembly language
    programs
  • -Emergence of platform-independent OS(Unix,
    Linux)
  • Effect of the above resulted in the performance
    growth(of microprocessors) at the rate of 1.52
    times/year
  • Result of this rapid growth
  • - Enhanced capability to userslarge, complex
    programs data easily accessible
  • - Dominance of microprocessor-based computers
    across entire range of computersworkstations,
    servers, embedded systems, multiprocessor
    supercomputers

17
Changes in Chip Design
  • Intel 4004 (1971) 4-bit processor,2312
    transistors, 0.4 MHz, 10 micron PMOS, 11 mm2
    chip
  • RISC II (1983) 32-bit, 5 stage pipeline, 40,760
    transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip
  • 125 mm2 chip, 0.065 micron CMOS 2312 RISC
    IIFPUIcacheDcache
  • RISC II shrinks to 0.02 mm2 at 65 nm
  • Caches via DRAM or 1 transistor SRAM
    (www.t-ram.com) ?
  • Proximity Communication via capacitive coupling
    at gt 1 TB/s ?
  • Wafer stacking(3D die)
  • Processor is the new transistor?

18
New Conventional Wisdom
  • 2002-date? only 20/year due to power limit for
    air-cooled chips, no more room for ILP
    improvement, no improvement on memory latency
  • Focus changed from high perform uniproc. ?
    multiproc. on a chip
  • From ILP? TLP(Thread Level Parallelism)
    DLP(Data Level Parallelism)

19
Problems with the Change
  • Algorithms, Programming Languages, Compilers,
    Operating Systems, Architectures, Libraries,
    not ready to supply Thread Level Parallelism or
    Data Level Parallelism for 1000 CPUs / chip,
  • Architectures not ready for 1000 CPUs / chip
  • Unlike Instruction Level Parallelism, cannot be
    solved by just by computer architects and
    compiler writers alone, but also cannot be solved
    without participation of computer architects
  • This 4th Edition of textbook Computer
    Architecture A Quantitative Approach) explores
    shift from Instruction Level Parallelism to
    Thread Level Parallelism / Data Level Parallelism

20
ILP ? TLP DLP
  • ILPcompilers and hardware exploit ILP so that
    programmers do not worry
  • TLP DLPrequires programmers to write parallel
    programs to utilize the parallelism

21
Instruction Set Architecture Critical Interface
software
instruction set
hardware
  • Properties of a good abstraction
  • Lasts through many generations (portability)
  • Used in many different ways (generality)
  • Provides convenient functionality to higher
    levels
  • Permits an efficient implementation at lower
    levels

22
Example MIPS
Programmable storage 232 x bytes 31 x 32-bit
GPRs (R00) 32 x 32-bit FP regs (paired DP) HI,
LO, PC
Data types ? Format ? Addressing Modes?
Arithmetic logical Add, AddU, Sub, SubU,
And, Or, Xor, Nor, SLT, SLTU, AddI, AddIU,
SLTI, SLTIU, AndI, OrI, XorI, LUI SLL, SRL, SRA,
SLLV, SRLV, SRAV Memory Access LB, LBU, LH, LHU,
LW, LWL,LWR SB, SH, SW, SWL, SWR Control J,
JAL, JR, JALR BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZA
L,BGEZAL
32-bit instructions on word boundary
23
Instruction Set Architecture
  • ... the attributes of a computing system as
    seen by the programmer, i.e. the conceptual
    structure and functional behavior, as distinct
    from the organization of the data flows and
    controls the logic design, and the physical
    implementation.
  • Amdahl, Blaauw, and Brooks, 1964

-- Organization of Programmable Storage --
Data Types Data Structures Encodings
Representations -- Instruction Formats --
Instruction (or Operation Code) Set -- Modes of
Addressing and Accessing Data Items and
Instructions -- Exceptional Conditionsinterrupt
s, faults
24
ISA vs. Computer Architecture
  • Old definition of computer architecture
    instruction set design
  • Other aspects of computer design called
    implementation
  • Insinuates that implementation is uninteresting
    or less challenging
  • Our view is Computer architecture gtgt ISA
  • Architects job much more than instruction set
    design technical hurdles today more challenging
    than those in instruction set design
  • Since instruction set design not where action is,
    some conclude computer architecture (using old
    definition) is not where action is
  • We disagree on conclusion
  • Agree that ISA not where action is (ISA in CAAQA
    4/e appendix)

25
Comp. Arch. is an Integrated Approach
  • What really matters is the functioning of the
    complete system
  • hardware, runtime system, compiler, operating
    system, and application
  • In networking, this is called the End to End
    communication
  • Computer architecture is not just about
    transistors, individual instructions, or
    particular implementations
  • E.g., Original RISC projects replaced complex
    instructions with a compiler simple instructions

26
Classes of computers and application areas
  • Desktop Computing general purpose desktop/laptop
  • - Increased productivity, interactive graphics,
    video, audio
  • - Web-centric, interactive applications demand
    more processing power
  • - Optimized price-performancetoo much focus on
    clock rate
  • - Examples Intel Pentium, AMD Athlon XP
  • Servers powerful, large-scale file and computing
    services
  • - Internet opened client-server computing
    paradigm
  • - Web-based services require high computing
    power
  • - Large database, transaction processing, search
    engines
  • - Scientific applicationsweather, defense, gene
    researchIBM BlueGene
  • - Performance, Availability(24/7),
    Scalability(growth)
  • - Server downtime cost more than 6M/hour for a
    brokerage company(Fig 1.2)
  • - Examples Sun Fire E20K, IBM eServer p5,
    Google Cluster
  • Embedded Computers computers embedded in other
    devices
  • - PDAs, cell-phones, sensors, in cars gt Price,
    Energy efficiencymain concerns
  • - Game machines, Network uPs(Routers, switches)
    gt Price-Performance ratio
  • - Examples ARM926EJ-S, Sony Emotion Engine, IBM
    750FX

27
What Computer Architecture brings to Table
  • Other fields often borrow ideas from architecture
  • Quantitative Principles of Design
  • Take Advantage of Parallelism
  • Principle of Locality
  • Focus on the Common Case
  • Amdahls Law
  • The Processor Performance Equation
  • Careful, quantitative comparisons
  • Define, quantify, and summarize relative
    performance
  • Define and quantify relative cost
  • Define and quantify dependability
  • Define and quantify power
  • Culture of anticipating and exploiting advances
    in technology
  • Culture of well-defined interfaces that are
    carefully implemented and thoroughly checked

28
1) Taking Advantage of Parallelism
  • Increasing throughput of computer via multiple
    processors or multiple disks
  • Detailed HW design
  • Carry lookahead adders uses parallelism to speed
    up computing sums from linear to logarithmic in
    number of bits per operand
  • Multiple memory banks searched in parallel in
    set-associative caches
  • Pipelining overlap instruction execution to
    reduce the total time to complete an instruction
    sequence.
  • Not every instruction depends on immediate
    predecessor ? executing instructions
    completely/partially in parallel possible
  • Classic 5-stage pipeline 1) Instruction Fetch
    (Ifetch), 2) Register Read (Reg), 3) Execute
    (ALU), 4) Data Memory Access (Dmem), 5)
    Register Write (Reg)

29
Pipelined Instruction Execution
30
Limits to pipelining
  • Hazards prevent next instruction from executing
    during its designated clock cycle
  • Structural hazards attempt to use the same
    hardware to do two different things at once
  • Data hazards Instruction depends on result of
    prior instruction still in the pipeline
  • Control hazards Caused by delay between the
    fetching of instructions and decisions about
    changes in control flow (branches and jumps).

Time (clock cycles)
I n s t r. O r d e r
31
2) The Principle of Locality
  • The Principle of Locality
  • Program access a relatively small portion of the
    address space at any instant of time.
  • Two Different Types of Locality
  • Temporal Locality (Locality in Time) If an item
    is referenced, it will tend to be referenced
    again soon (e.g., loops, reuse)
  • Spatial Locality (Locality in Space) If an item
    is referenced, items whose addresses are close by
    tend to be referenced soon (e.g., straight-line
    code, array access)
  • Last 30 years, HW relied on locality for memory
    perf.

32
Levels of the Memory Hierarchy
Capacity Access Time Cost
Upper Level
Staging Xfer Unit
CPU Registers 100s Bytes 300 500 ps (0.3-0.5 ns)
Registers
faster
prog./compiler 1-8 bytes
Instr. Operands
L1 Cache
L1 and L2 Cache 10s-100s K Bytes 1 ns - 10
ns 1000s/ GByte
cache cntl 32-64 bytes
Blocks
L2 Cache
cache cntl 64-128 bytes
Blocks
Main Memory G Bytes 80ns- 200ns 100/ GByte
Memory
OS 4K-8K bytes
Pages
Disk 10s T Bytes, 10 ms (10,000,000 ns) 1 /
GByte
Disk
user/operator Mbytes
Files
Larger
Tape infinite sec-min 1 / GByte
Tape
Lower Level
33
3) Focus on the Common Case
  • Common sense guides computer design
  • Since its engineering, common sense is valuable
  • In making a design trade-off, favor the frequent
    case over the infrequent case
  • E.g., Instruction fetch and decode unit used more
    frequently than multiplier, so optimize it 1st
  • E.g., If database server has 50 disks /
    processor, storage dependability dominates system
    dependability, so optimize it 1st
  • Frequent case is often simpler and can be done
    faster than the infrequent case
  • E.g., overflow is rare when adding 2 numbers, so
    improve performance by optimizing more common
    case of no overflow
  • May slow down overflow, but overall performance
    improved by optimizing for the normal case
  • What is frequent case and how much performance
    improved by making case faster gt Amdahls Law

34
4) Amdahls Law
Best you could ever hope to do
35
Amdahls Law example
  • New CPU 10X faster
  • I/O bound server, so 60 time waiting for I/O
  • Apparently, it is human nature to be attracted by
    10X faster, vs. keeping in perspective its just
    1.6X faster

36
(No Transcript)
37
Another Example
Suppose we can enhance the performance of a CPU
by adding a vectorization mode that can be used
under certain conditions, that will compute 10x
faster than normal mode. What percent of the run
time must be in vectorization mode to achieve
overall speedup of 2?
38
Metrics of Performance
Application
Answers per day/month Operations per second
Programming Language
Compiler
Millions of instructions per second
MIPS Millions of FP operations per second MFLOPS
ISA
Datapath
Megabytes per second
Control
Function Units
Cycles per second (clock rate)
Transistors
Wires
Pins
39
Marketing Metrics
  • How about machines with different instruction
    sets?
  • Programs with different instruction mixes?
  • Dynamic frequency of instructions
  • It is not closely related with performance
  • Often not where time is spenton most programs,
    there are not many FP instructions executed
  • Therefore ? next slide

40
5) Processor performance equation
CPI
Time taken to run a program
inst count
Cycle time
CPU time Seconds Instructions
Cycles Seconds Program Program
Instruction Cycle
x
x
  • Inst Count CPI Clock Rate
  • Program X
  • Compiler X (X)
  • Inst. Set. X X
  • Organization X X
  • Technology X

41
CPI (Cycles Per Instruction)
  • Average Cycles per Instruction (for the program
    in question)
  • CPI (CPU Time ? Clock Rate) / Instruction
    Count Cycles / Instruction Count CPU time
    Cycle Time ?

42
And in conclusion
  • Computer Architecture gtgt instruction sets
  • Computer Architecture skill sets are different
  • 5 Quantitative principles of design
  • Quantitative approach to design
  • Solid interfaces that really work
  • Technology tracking and anticipation
  • CS 620 to learn new skills, transition to
    research
  • Computer Science at the crossroads from
    sequential to parallel computing
  • Requires innovation in many fields, including
    computer architecture

43
CSCI 620 focuses on
Understanding the design techniques, machine
structures, technology factors, evaluation
methods that will determine the form of computers
in 21st Century
Parallelism
Technology
Programming
Languages
Applications
Interface Design (ISA)
Computer Architecture Organization
Hardware/Software Boundary
Compilers
Operating
Measurement Evaluation
History
Systems
44
Moores Law 2X transistors / year
  • Cramming More Components onto Integrated
    Circuits
  • Gordon Moore, Electronics, 1965
  • on transistors / cost-effective integrated
    circuit double every N months (12 N 24)

45
Trends in Technology(1.4)
  • Drill down into 4 technologies
  • Disks,
  • Memory,
  • Network,
  • Processorsintegrated circuits
  • Compare 1980 Archaic (Nostalgic) vs. 2000
    Modern (Newfangled)
  • Performance Milestones in each technology
  • Compare for Bandwidth vs. Latency improvements in
    performance over time
  • Bandwidth Total amount of work done in a given
    time
  • E.g., M bits/second over network, M bytes/second
    from disk
  • Latency elapsed time for a single event from
    start to finish
  • E.g., one-way network delay in microseconds,
    average disk access time in milliseconds

46
Disks Archaic(Nostalgic) v. Modern(Newfangled)
  • Seagate 373453, 2003
  • 15000 RPM (4X)
  • 73.4 GBytes (2500X)
  • Tracks/Inch 64000 (80X)
  • Bits/Inch 533,000 (60X)
  • Four 2.5 platters (in 3.5 form factor)
  • Bandwidth 86 MBytes/sec (140X)
  • Latency 5.7 ms (8X)
  • Cache 8 MBytes
  • CDC Wren I, 1983
  • 3600 RPM
  • 0.03 GBytes capacity
  • Tracks/Inch 800
  • Bits/Inch 9550
  • Three 5.25 platters
  • Bandwidth 0.6 MBytes/sec
  • Latency 48.3 ms
  • Cache none

47
Latency Lags Bandwidth (for last 20 years)
  • Performance Milestones
  • 30 up to 1990
  • 100 in 1996 up to 2004
  • 30 since 2004
  • Disk still 50-100 times cheaper than DRAM
  • Disk 3600, 5400, 7200, 10000, 15000 RPM (8x,
    143x)

(latency simple operation w/o contention BW
best-case)
48
Memory Archaic (Nostalgic) v. Modern (Newfangled)
  • 2000 Double Data Rate Synchr. (clocked) DRAM
  • 256.00 Mbits/chip (4000X)
  • 256,000,000 xtors, 204 mm2
  • 64-bit data bus per DIMM, 66 pins/chip (4X)
  • 1600 Mbytes/sec (120X)
  • Latency 52 ns (4X)
  • Block transfers (page mode)
  • 1980 DRAM (asynchronous)
  • 0.06 Mbits/chip
  • 64,000 xtors, 35 mm2
  • 16-bit data bus per module, 16 pins/chip
  • 13 Mbytes/sec
  • Latency 225 ns
  • (no block transfer)

49
Latency Lags Bandwidth (last 20 years)
  • Performance Milestones
  • Capacity increases 40/year2 times/2 years
  • Memory Module 16 bit plain DRAM, Page Mode DRAM,
    32b, 64b, SDRAM, DDR SDRAM (4x,120x)
  • Disk 3600, 5400, 7200, 10000, 15000 RPM (8x,
    143x)

Latency
BW
Latency
BW
(latency simple operation w/o contention BW
best-case)
50
LANs Archaic (Nostalgic)v. Modern (Newfangled)
  • Ethernet 802.3
  • Year of Standard 1978
  • 10 Mbits/s link speed
  • Latency 3000 msec
  • Shared media
  • Coaxial cable
  • Ethernet 802.3ae
  • Year of Standard 2003
  • 10,000 Mbits/s (1000X)link speed
  • Latency 190 msec (15X)
  • Switched media
  • Category 5 copper wire

Coaxial Cable
Plastic Covering
Braided outer conductor
Insulator
Copper core
51
Latency Lags Bandwidth (last 20 years)
  • Performance Milestones
  • Ethernet 10Mb, 100Mb, 1000Mb, 10000 Mb/s
    (16x,1000x)
  • Memory Module 16bit plain DRAM, Page Mode DRAM,
    32b, 64b, SDRAM, DDR SDRAM (4x,120x)
  • Disk 3600, 5400, 7200, 10000, 15000 RPM (8x,
    143x)

(latency simple operation w/o contention BW
best-case)
52
CPUs Archaic (Nostalgic) v. Modern (Newfangled)
  • 2001 Intel Pentium 4
  • 1500 MHz (120X)
  • 4500 MIPS (peak) (2250X)
  • Latency 15 ns (20X)
  • 42,000,000 xtors, 217 mm2
  • 64-bit data bus, 423 pins
  • 3-way superscalar,Dynamic translate to RISC,
    Superpipelined (22 stage),Out-of-Order execution
  • On-chip 8KB Data caches, 96KB Instr. Trace
    cache, 256KB L2 cache
  • 1982 Intel 80286
  • 12.5 MHz
  • 2 MIPS (peak)
  • Latency 320 ns
  • 134,000 xtors, 47 mm2
  • 16-bit data bus, 68 pins
  • Microcode interpreter, separate FPU chip
  • (no caches)

53
Latency Lags Bandwidth (last 20 years)
  • Performance Milestones
  • Processor 286, 386, 486, Pentium, Pentium
    Pro, Pentium 4 (21x,2250x)
  • Ethernet 10Mb, 100Mb, 1000Mb, 10000 Mb/s
    (16x,1000x)
  • Memory Module 16bit plain DRAM, Page Mode DRAM,
    32b, 64b, SDRAM, DDR SDRAM (4x,120x)
  • Disk 3600, 5400, 7200, 10000, 15000 RPM (8x,
    143x)

54
Rule of Thumb for Latency Lagging BW
  • In the time that bandwidth doubles, latency
    improves by no more than a factor of 1.2 to 1.4
  • (and capacity improves faster than bandwidth)
  • Stated alternatively Bandwidth improves by more
    than the square of the improvement in Latency

55
6 Reasons Latency Lags Bandwidth
  • 1. Moores Law helps BW more than latency
  • Faster transistors, more transistors, more pins
    help Bandwidth
  • MPU Transistors 0.130 M vs. 42 M xtors (300X)
  • DRAM Transistors 0.064 M vs. 256 M xtors
    (4000X)
  • MPU Pins 68 vs. 423 pins (6X)
  • DRAM Pins 16 vs. 66 pins (4X)
  • Smaller, faster transistors but communicate over
    (relatively) longer lines limits latency
  • Feature size 1.5 to 3 vs. 0.18 micron (8X,17X)
  • MPU Die Size 35 vs. 204 mm2 (ratio sqrt ? 2X)
  • DRAM Die Size 47 vs. 217 mm2 (ratio sqrt ?
    2X)

56
6 Reasons Latency Lags Bandwidth (contd)
  • 2. Distance limits latency
  • Size of DRAM block ? long bit and word lines ?
    most of DRAM access time
  • 3. Bandwidth easier to sell (biggerbetter)
  • E.g., 10 Gbits/s Ethernet (10 Gig) vs. 10
    msec latency Ethernet
  • 4400 MB/s DIMM (PC4400) vs. 50 ns latency
  • Even if just marketing, customers now trained
  • Since bandwidth sells, more resources thrown at
    bandwidth, which further tips the balance

57
6 Reasons Latency Lags Bandwidth (contd)
  • 4. Latency helps BW, but not vice versa
  • Spinning disk faster improves both bandwidth and
    rotational latency
  • 3600 RPM ? 15000 RPM 4.2X
  • Average rotational latency 8.3 ms ? 2.0 ms
  • Things being equal, also helps BW by 4.2X
  • Lower DRAM latency ? More access/second (higher
    bandwidth)
  • Higher linear density helps disk BW (and
    capacity), but not disk Latency
  • 9,550 BPI ? 533,000 BPI ? 60X in BW

58
6 Reasons Latency Lags Bandwidth (contd)
  • 5. Bandwidth hurts latency
  • Adding chips to widen a memory module increases
    Bandwidth but higher fan-out on address lines may
    increase Latency
  • 6. Operating System overhead hurts Latency more
    than Bandwidth
  • Long messages lessen the effect of overhead
    overhead bigger part of short messages

59
Summary of Technology Trends
  • For disk, LAN, memory, and microprocessor,
    bandwidth improves by square of latency
    improvement
  • In the time that bandwidth doubles, latency
    improves by no more than 1.2X to 1.4X
  • Lag probably even larger in real systems, as
    bandwidth gains multiplied by replicated
    components
  • Multiple processors in a cluster or even in a
    chip
  • Multiple disks in a disk array
  • Multiple memory modules in a large memory
  • Simultaneous communication in switched LAN
  • HW and SW developers should innovate assuming
    Latency Lags Bandwidth
  • If everything improves at the same rate, then
    nothing really changes
  • When rates vary, require real innovation

60
Scaling in ICs
  • Feature size minimum size of a single
    transistor or a wire on a chip die
  • 1971 10 microns
  • 2001 0.18 microns
  • 2003 0.06 microns
  • 2006 5 nanometers (0.005 microns)20001 ratio
    from 1971
  • Complex relationships bet. Performance feature
    size
  • Transistor density increases quadratically with
    decrease in feature size
  • Reduction in feature size requires voltage
    reduction to maintain correct operation and
    reasonable reliability
  • Scaling IC wiring
  • Signal delay increases with product of resistance
    and capacitance
  • Shorter wires not necessarily faster due to
    increased resistance and capacitance

61
Power Consumption of ICs(1.5)
  • Energy requirements per transistor are
    proportional to load capacitance, frequency of
    switching and the square of the voltage.
  • Switching frequency and density of transistors
    increases faster than decrease in capacitance and
    voltage, leading to increased power consumption
    generated heat
  • http//www.phys.ncku.edu.tw/htsu/humor/fry_egg.ht
    ml
  • Pentium 4 consumes 100 Watts of power while the
    8086-i386 did not even feature a heat-sink

62
Cost and Price
  • Cost of manufacturing decreases over time
    learning curve
  • Learning curve is measured as an increase in
    yield
  • Volume doubling leads to 10 reduction in cost
  • Commodity products tend to decrease cost
  • Volume
  • Competition
  • Efficiency

63
Cost of Pentium--Figure 1.10 of text
64
IC Manufacturing Process
Silicon Wafer
Patterned Silicon Wafer
Wafer Test
Silicon Cylinder
Packaged Die
Unpackaged Die
Final Test
65
Wafers and Dies
  • Chips are produced on round silicon diskswafers
  • Dies are the actual chip, cut out from the wafer
  • Testing occurs before cutting and after packaging

66
Yield and Cost of chips
  • However
  • Wafers do not just contain chip-dies, usually a
    large area, including several chip-dies, is
    dedicated for test equipment hook-up
  • Actual yield in mass-production chip-fabs varies
    between 98 for DRAMS to 1 for new Processors

67
Yield and Cost
  • Switch from 200mm to 300mm wafers
  • Although 300mm wafers have lower yield than 200mm
    wafers, the overhead processing costs per wafer
    are high enough to make 300mm wafers more cost
    effective.
  • Next candidate size is 450mm(2012?)
  • Redundancy in dies
  • Single transistors do fail during production,
    causing memory cells, pipeline stages, control
    logic sections to fail
  • Redundancy is built into the each die by
    introducing backup-units
  • After testing, backup units are enabled and
    failed units can be disabled by LASER
  • This decreases the chances of small flaws failing
    an entire die
  • No company yet has released their redundant
    circuitry numbers

68
Difference between Cost and Price of IC
69
Measuring Performance(1.5)
  • Key measure is time.
  • Response time (execution time) Time between
    start and completion of a task. Includes
    everythingcpu, I/O, timesharing, .
  • - CPU timeuser CPU time, system CPU time
  • Throughput total amount of work completed in a
    given time.
  • System performancefor unloaded system
  • CPU performance for unloaded systemmain focus
    in this chapter
  • Formula for user (program) CPU time is

70
Comparing Processors(measured by running
programs)
X is n times faster than Y means
71
Performance What to measuredecreasing order of
accuracy
  • Real programs(What customers care) e.g.,
    compilers, database,
  • -portability problemdependent upon OS other
    hardware
  • Modified or scripted from real programs(modified
    for portability) e.g., compression algorithms
  • Kernels small, key pieces from real program
    e.g., Livermore Loops, Linpack. Better for
    testing individual features
  • Toy benchmarks typically 10 to 100 lines of
    code, useful primarily for intro programming
    assignmentssmall, run on any computer e.g.,
    quicksort, prime numbers, encryption
  • Synthetic benchmarks try to match average
    frequency of operations and operands for a set of
    programsno realistic progs e.g., Whetstone,
    Dhrystone.

72
Benchmark Suites
  • Collections of benchmark applications, called
    benchmark suites, are popular
  • SPECCPU popular desktop benchmark suite
  • CPU only, split between integer and floating
    point programs
  • SPECint2006 has 12 integer, SPECfp2006 has 17
    floating point prgms
  • SPECCPU2006 V1.0 released August, 2006
  • SPECSFS (NFS file server) and SPECWeb (WebServer)
    added as server benchmarksfor other benchmarks,
    see here
  • Transaction Processing Council measures server
    performance and cost-performance for databases
  • TPC-C Complex query for Online Transaction
    Processing
  • TPC-H models ad hoc decision support
  • TPC-W a transactional web benchmark
  • TPC-App application server and web services
    benchmark

73
SPECint2006
SPECfp2006
Figure 1.13
74
How to Summarize Suite Performance (1/5)
  • Arithmetic average of execution time of all pgms?
  • But they vary by 4X in speed, so some would be
    more important than others in arithmetic average
  • Could add a weights per program, but how pick
    weight?
  • Different companies want different weights for
    their products
  • SPECRatio Normalize execution times to reference
    computer, yielding a ratio proportional to
    performance
  • time on reference computer
  • time on computer being rated
  • In CPU2006, SUN Microsystems Ultra Enterprise 2
    is used as reference computer

75
How Summarize Suite Performance (2/5)
  • If program SPECRatio(in CPU2006 it is called
    Base) on Computer A is 1.25 times bigger than
    Computer B, then
  • Note that when comparing 2 computers as a ratio,
    execution times on the reference computer drop
    out, so choice of reference computer is
    irrelevant

76
How Summarize Suite Performance (3/5)
  • Since ratios, proper mean is geometric mean
    (SPECRatio unitless, so arithmetic mean
    meaningless)
  • Geometric mean of the ratios is the same as the
    ratio of the geometric means
  • Ratio of geometric means Geometric mean of
    performance ratios ? choice of reference
    computer is irrelevant!
  • These two points make geometric mean of ratios
    attractive to summarize performance

77
How Summarize Suite Performance (4/5)
  • Does a single mean well summarize performance of
    programs in benchmark suite?
  • Can decide if mean a good predictor by
    characterizing variability of distribution using
    standard deviation
  • Like geometric mean, geometric standard deviation
    is multiplicative rather than arithmetic
  • Can simply take the logarithm of SPECRatios,
    compute the standard mean and standard deviation,
    and then take the exponent to convert back

78
How Summarize Suite Performance (5/5)
  • Standard deviation is more informative if know
    distribution has a standard form
  • bell-shaped normal distribution, whose data are
    symmetric around mean
  • lognormal distribution, where logarithms of
    data--not data itself--are normally distributed
    (symmetric) on a logarithmic scale
  • For a lognormal distribution, we expect that
  • 68 of samples fall in range
  • 95 of samples fall in range
  • Note Excel provides functions EXP(), LN(), and
    STDEV() that make calculating geometric mean and
    multiplicative standard deviation easy

79
Example Standard Deviation (1/2)
  • GM and multiplicative StDev of SPECfp2000 for
    Itanium 2

80
Example Standard Deviation (2/2)
  • GM and multiplicative StDev of SPECfp2000 for AMD
    Athlon

81
Comments on Itanium 2 and Athlon
  • Standard deviation of 1.98 for Itanium 2 is much
    higher-- vs. 1.40--so results will differ more
    widely from the mean, and therefore are likely
    less predictable
  • Falling within one standard deviation
  • 10 of 14 benchmarks (71) for Itanium 2
  • 11 of 14 benchmarks (78) for Athlon
  • Thus, the results are quite compatible with a
    lognormal distribution (expect 68)

82
And in conclusion
  • Tracking and extrapolating technology part of
    architects responsibility
  • Expect Bandwidth in disks, DRAM, network, and
    processors to improve by at least as much as the
    square of the improvement in Latency
  • Quantify dynamic and static power
  • Capacitance x Voltage2 x frequency, Energy vs.
    power
  • Quantify dependability
  • Reliability (MTTF, FIT), Availability (99.9)
  • Quantify and summarize performance
  • Ratios, Geometric Mean, Multiplicative Standard
    Deviation
  • Read Appendix A, record bugs online!
Write a Comment
User Comments (0)
About PowerShow.com