Title: CSCI 620
1 - CSCI 620
- Computer Architecture
- Introduction
- Seung Bae Im
- Spring 2007
- Reading Assignments
- Chapter 1
2Text Web pages
- Text Computer Architecture A Quantitative
Approach, Fourth Edition, Hennessy Patterson - Web Page www.ecst.csuhico.edu/sim
3Prerequisite Knowledge (CSCI 320)
- Assembly language programming
- Fundamentals of logic design
- Combinational and sequential logics (e.g., gates,
multiplexers, decoders, ALU, ROMs,
flip-flops, registers, RAMs) - Processor Design
- Instruction cycle, pipelining, branch prediction,
exceptions - Memory Hierarchy
- Caches (direct-mapped, fully-associative, n-way
set associative), spatial locality, temporal
locality, virtual memory, translation lookaside
buffer (TLB) - Disk systems
- Input and Output
- Polling, interrupts
- Multiprocessors
4Course Requirements
- Homeworks Specifications on web page
- 1 Midterm Exam Final Exam
- Course project Specifications on web page
- (A report plus a group presentation at the end
of the semester)
Course Evaluation Homeworks 15 Midterm
exam 30 Final exam 30 Project 25
5What is Computer Architecture?
- Two viewpoints
- Hardware designers viewpoint CPUs, caches,
buses, pipelines, memory, ILP, etc. - Programmers viewpoint instruction set
opcodes, addressing modes, registers, virtual
memory, etc. ISA(Instruction Set Architecture) - ? Study of architecture covers both
6Computer Architecture Is ...
- The attributes of a computing system as seen
by the programmer, i.e., the conceptual structure
and functional behavior, as distinct from the
organization of the data flows and controls the
logic design, and the physical implementation. -
- Amdahl, Blaauw, and Brooks, 1964
7Why learn computer architecture?
- How the hardware work?
- Where is the hardware heading toward?
- How does the advances in hardware affect system
software and applications? - -Understand OS/compilers/Programming Language
- -Helps us to write better code
- -Use computers more efficiently
- Future of computing technology?
8Computer Architectures Changing Definition
- 1950s to 1960s
- Computer Architecture Course Computer
Arithmetic. Logic design - 1970s to 1980s
- Computer Architecture Course Instruction Set
Design, especially ISA appropriate for compilers.
CISC vs RISC - 1990s and beyond
- Computer Architecture Course Design of CPU,
memory system, I/O system, multiprocessors.
95 Generations of Electronic Computers(60 years)
1st Generation (1945-54) Beginning (Vacuum tubes) Vacuum tubes, relay memories, CPU using PC, accumulator, fixed point arithmetic Machine/assembly languages, single user, batch processing, no subroutine linkages, programmed I/O (controlled by CPU) ENIAC, IBM 701
2nd Generation (1955-64) Mainframes (Transistors) Discrete transistors, core memories, floating-point arithmetic, I/O processors, multiplexed memory access High Level Languages using compilers, subroutines, libraries, batch processing terminals IBM 7030, CDC 1604, Univac LARC (mainframes)
3rd Generation (1965-74) Mainframe, Miniframe (Integrated Circuits) Integrated Circuits (SSI/MSI), pipelining, cache, lookahead Multiprogramming and time-sharing OS, multiuser applications IBM 360/370, CDC 6600, TI-ASC, PDP-8 (mainframe, miniframe)
4th Generation (1975-90) Microprocessors (VLSI) LSI/VLSI, semiconductor memory,multiprocessors, vector supercomputers, multiprocessors, parallel processors Multiprocessor OS, Higher Level Languages, advanced compilers, parallel processing, RISC ISA VAX/9000, CRAY X/MP, IBM 3090, Intel, Motorola, MIPS
5th Generation (1991-present) Dominance of microprocessors ULSI/VHSIC, high density memories, scalable architectures, Massively parallel processing, Instruction Level Parallelism, large/complex applications, ubiquitous computing IBM/MPP, CRAY/MPP, TMC/CM-5, Intel Paragon
10Moores Law
- In 1965, Intel co-founder Gordon Moore saw the
future. His prediction, now popularly known as
Moore's Law, states that the number of
transistors on a chip doubles about every two
years. This observation about silicon
integration, made a reality by Intel, the world's
largest silicon supplier, has fueled the
worldwide technology revolution. - Every 18-24 months
- Feature sizes shrink by 0.7x
- Number of transistors per die increases by 2x
- Speed of transistors increases by 1.4x
- But we are starting to hit some roadblocks
- Also, what to do with all of these
transistors???
11Moores Law (Transistor density)Intel only
Moore's Law Means More Performance. Processing
power, measured in millions of instructions per
second (MIPS), has steadily risen because of
increased transistor counts.
12The transistors have been used to
- More functionality on one chip
- Early 1980s 32-bit microprocessors
- Late 1980s On Chip Level 1 Caches
- Early/Mid 1990s 64-bit microprocessors,
superscalar (ILP) - Late 1990s On Chip Level 2 Caches
- Early 2000s Chip-level Multiprocessors,
On-Chip Level 3 Caches - What next?
- How much cache on a chip? (Itanium 2up to
level 3 on-chip) - How many cores on a chip? ( SUN Niagara8
CPUs) - What else can we put on chips?
- The Trend to Many-Core from Intel siteRattner
introduced the "many-core" concept, explaining
that Intel researchers and scientists are
experimenting with "many tens of cores,
potentially even hundreds of cores per die, per
single processor die. And those cores will be
supporting tens, hundreds, maybe even thousands
of simultaneous execution threads."
13Crossroads Conventional Wisdom in Comp. Arch
- Old Conventional Wisdom Power is free,
Transistors expensive - New Conventional Wisdom Power wall Power
expensive, Xtors free (Can put more on chip than
can afford to turn on) - Old CW Sufficiently increasing Instruction Level
Parallelism via compilers, innovation
(Out-of-order, speculation, VLIW, ) - New CW ILP wall law of diminishing returns on
more HW for ILP - Old CW Multiplies are slow, Memory access is
fast - New CW Memory wall Memory slow, multiplies
fast (200 clock cycles to DRAM memory, 4 clocks
for multiply) - Old CW Uniprocessor performance 2X / 1.5 yrs
- New CW Power Wall ILP Wall Memory Wall
Brick Wall - Uniprocessor performance now 2X / 5(?) yrs
- ? Sea change in chip design multiple cores
(2X processors per chip / 2 years) - More simpler processors are more power efficient
14Power wall
15Crossroads Uniprocessor Performance
From Hennessy and Patterson, Computer
Architecture A Quantitative Approach, 4th
edition, October, 2006
- VAX 25/year 1978 to 1986
- RISC x86 52/year 1986 to 2002
- RISC x86 ??/year 2002 to presentabout
20/year
VAX
16Performance of Microprocessors (previous page)
- Up to late 70s(before the birth of
microprocessors), mainframe or miniframe
computers dominated. Their performance improved
constantly - The improvement was due to the technological
advances in electronics and innovation in
computer architecturetheir growth rate was
25-30 per year - From late 70s, microprocessors (which benefited
more from IC technologies than main/miniframes)
have shown 35 growth rate during early 80s - During 80s, a new ISA(Instruction Set
Architecture), RISC architecture evolved which
was possible due to two reasons - - Less object-code compatibility due to the
usage of HLLless assembly language
programs - -Emergence of platform-independent OS(Unix,
Linux) - Effect of the above resulted in the performance
growth(of microprocessors) at the rate of 1.52
times/year - Result of this rapid growth
- - Enhanced capability to userslarge, complex
programs data easily accessible - - Dominance of microprocessor-based computers
across entire range of computersworkstations,
servers, embedded systems, multiprocessor
supercomputers
17Changes in Chip Design
- Intel 4004 (1971) 4-bit processor,2312
transistors, 0.4 MHz, 10 micron PMOS, 11 mm2
chip
- RISC II (1983) 32-bit, 5 stage pipeline, 40,760
transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip
- 125 mm2 chip, 0.065 micron CMOS 2312 RISC
IIFPUIcacheDcache - RISC II shrinks to 0.02 mm2 at 65 nm
- Caches via DRAM or 1 transistor SRAM
(www.t-ram.com) ? - Proximity Communication via capacitive coupling
at gt 1 TB/s ? - Wafer stacking(3D die)
- Processor is the new transistor?
18New Conventional Wisdom
- 2002-date? only 20/year due to power limit for
air-cooled chips, no more room for ILP
improvement, no improvement on memory latency - Focus changed from high perform uniproc. ?
multiproc. on a chip - From ILP? TLP(Thread Level Parallelism)
DLP(Data Level Parallelism)
19Problems with the Change
- Algorithms, Programming Languages, Compilers,
Operating Systems, Architectures, Libraries,
not ready to supply Thread Level Parallelism or
Data Level Parallelism for 1000 CPUs / chip, - Architectures not ready for 1000 CPUs / chip
- Unlike Instruction Level Parallelism, cannot be
solved by just by computer architects and
compiler writers alone, but also cannot be solved
without participation of computer architects - This 4th Edition of textbook Computer
Architecture A Quantitative Approach) explores
shift from Instruction Level Parallelism to
Thread Level Parallelism / Data Level Parallelism
20ILP ? TLP DLP
- ILPcompilers and hardware exploit ILP so that
programmers do not worry - TLP DLPrequires programmers to write parallel
programs to utilize the parallelism
21Instruction Set Architecture Critical Interface
software
instruction set
hardware
- Properties of a good abstraction
- Lasts through many generations (portability)
- Used in many different ways (generality)
- Provides convenient functionality to higher
levels - Permits an efficient implementation at lower
levels
22Example MIPS
Programmable storage 232 x bytes 31 x 32-bit
GPRs (R00) 32 x 32-bit FP regs (paired DP) HI,
LO, PC
Data types ? Format ? Addressing Modes?
Arithmetic logical Add, AddU, Sub, SubU,
And, Or, Xor, Nor, SLT, SLTU, AddI, AddIU,
SLTI, SLTIU, AndI, OrI, XorI, LUI SLL, SRL, SRA,
SLLV, SRLV, SRAV Memory Access LB, LBU, LH, LHU,
LW, LWL,LWR SB, SH, SW, SWL, SWR Control J,
JAL, JR, JALR BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZA
L,BGEZAL
32-bit instructions on word boundary
23Instruction Set Architecture
- ... the attributes of a computing system as
seen by the programmer, i.e. the conceptual
structure and functional behavior, as distinct
from the organization of the data flows and
controls the logic design, and the physical
implementation. - Amdahl, Blaauw, and Brooks, 1964
-- Organization of Programmable Storage --
Data Types Data Structures Encodings
Representations -- Instruction Formats --
Instruction (or Operation Code) Set -- Modes of
Addressing and Accessing Data Items and
Instructions -- Exceptional Conditionsinterrupt
s, faults
24ISA vs. Computer Architecture
- Old definition of computer architecture
instruction set design - Other aspects of computer design called
implementation - Insinuates that implementation is uninteresting
or less challenging - Our view is Computer architecture gtgt ISA
- Architects job much more than instruction set
design technical hurdles today more challenging
than those in instruction set design - Since instruction set design not where action is,
some conclude computer architecture (using old
definition) is not where action is - We disagree on conclusion
- Agree that ISA not where action is (ISA in CAAQA
4/e appendix)
25Comp. Arch. is an Integrated Approach
- What really matters is the functioning of the
complete system - hardware, runtime system, compiler, operating
system, and application - In networking, this is called the End to End
communication - Computer architecture is not just about
transistors, individual instructions, or
particular implementations - E.g., Original RISC projects replaced complex
instructions with a compiler simple instructions
26Classes of computers and application areas
- Desktop Computing general purpose desktop/laptop
- - Increased productivity, interactive graphics,
video, audio - - Web-centric, interactive applications demand
more processing power - - Optimized price-performancetoo much focus on
clock rate - - Examples Intel Pentium, AMD Athlon XP
- Servers powerful, large-scale file and computing
services - - Internet opened client-server computing
paradigm - - Web-based services require high computing
power - - Large database, transaction processing, search
engines - - Scientific applicationsweather, defense, gene
researchIBM BlueGene - - Performance, Availability(24/7),
Scalability(growth) - - Server downtime cost more than 6M/hour for a
brokerage company(Fig 1.2) - - Examples Sun Fire E20K, IBM eServer p5,
Google Cluster - Embedded Computers computers embedded in other
devices - - PDAs, cell-phones, sensors, in cars gt Price,
Energy efficiencymain concerns - - Game machines, Network uPs(Routers, switches)
gt Price-Performance ratio - - Examples ARM926EJ-S, Sony Emotion Engine, IBM
750FX
27What Computer Architecture brings to Table
- Other fields often borrow ideas from architecture
- Quantitative Principles of Design
- Take Advantage of Parallelism
- Principle of Locality
- Focus on the Common Case
- Amdahls Law
- The Processor Performance Equation
- Careful, quantitative comparisons
- Define, quantify, and summarize relative
performance - Define and quantify relative cost
- Define and quantify dependability
- Define and quantify power
- Culture of anticipating and exploiting advances
in technology - Culture of well-defined interfaces that are
carefully implemented and thoroughly checked
281) Taking Advantage of Parallelism
- Increasing throughput of computer via multiple
processors or multiple disks - Detailed HW design
- Carry lookahead adders uses parallelism to speed
up computing sums from linear to logarithmic in
number of bits per operand - Multiple memory banks searched in parallel in
set-associative caches - Pipelining overlap instruction execution to
reduce the total time to complete an instruction
sequence. - Not every instruction depends on immediate
predecessor ? executing instructions
completely/partially in parallel possible - Classic 5-stage pipeline 1) Instruction Fetch
(Ifetch), 2) Register Read (Reg), 3) Execute
(ALU), 4) Data Memory Access (Dmem), 5)
Register Write (Reg)
29Pipelined Instruction Execution
30Limits to pipelining
- Hazards prevent next instruction from executing
during its designated clock cycle - Structural hazards attempt to use the same
hardware to do two different things at once - Data hazards Instruction depends on result of
prior instruction still in the pipeline - Control hazards Caused by delay between the
fetching of instructions and decisions about
changes in control flow (branches and jumps).
Time (clock cycles)
I n s t r. O r d e r
312) The Principle of Locality
- The Principle of Locality
- Program access a relatively small portion of the
address space at any instant of time. - Two Different Types of Locality
- Temporal Locality (Locality in Time) If an item
is referenced, it will tend to be referenced
again soon (e.g., loops, reuse) - Spatial Locality (Locality in Space) If an item
is referenced, items whose addresses are close by
tend to be referenced soon (e.g., straight-line
code, array access) - Last 30 years, HW relied on locality for memory
perf.
32Levels of the Memory Hierarchy
Capacity Access Time Cost
Upper Level
Staging Xfer Unit
CPU Registers 100s Bytes 300 500 ps (0.3-0.5 ns)
Registers
faster
prog./compiler 1-8 bytes
Instr. Operands
L1 Cache
L1 and L2 Cache 10s-100s K Bytes 1 ns - 10
ns 1000s/ GByte
cache cntl 32-64 bytes
Blocks
L2 Cache
cache cntl 64-128 bytes
Blocks
Main Memory G Bytes 80ns- 200ns 100/ GByte
Memory
OS 4K-8K bytes
Pages
Disk 10s T Bytes, 10 ms (10,000,000 ns) 1 /
GByte
Disk
user/operator Mbytes
Files
Larger
Tape infinite sec-min 1 / GByte
Tape
Lower Level
333) Focus on the Common Case
- Common sense guides computer design
- Since its engineering, common sense is valuable
- In making a design trade-off, favor the frequent
case over the infrequent case - E.g., Instruction fetch and decode unit used more
frequently than multiplier, so optimize it 1st - E.g., If database server has 50 disks /
processor, storage dependability dominates system
dependability, so optimize it 1st - Frequent case is often simpler and can be done
faster than the infrequent case - E.g., overflow is rare when adding 2 numbers, so
improve performance by optimizing more common
case of no overflow - May slow down overflow, but overall performance
improved by optimizing for the normal case - What is frequent case and how much performance
improved by making case faster gt Amdahls Law
344) Amdahls Law
Best you could ever hope to do
35Amdahls Law example
- New CPU 10X faster
- I/O bound server, so 60 time waiting for I/O
- Apparently, it is human nature to be attracted by
10X faster, vs. keeping in perspective its just
1.6X faster
36(No Transcript)
37Another Example
Suppose we can enhance the performance of a CPU
by adding a vectorization mode that can be used
under certain conditions, that will compute 10x
faster than normal mode. What percent of the run
time must be in vectorization mode to achieve
overall speedup of 2?
38Metrics of Performance
Application
Answers per day/month Operations per second
Programming Language
Compiler
Millions of instructions per second
MIPS Millions of FP operations per second MFLOPS
ISA
Datapath
Megabytes per second
Control
Function Units
Cycles per second (clock rate)
Transistors
Wires
Pins
39Marketing Metrics
- How about machines with different instruction
sets? - Programs with different instruction mixes?
- Dynamic frequency of instructions
- It is not closely related with performance
- Often not where time is spenton most programs,
there are not many FP instructions executed - Therefore ? next slide
405) Processor performance equation
CPI
Time taken to run a program
inst count
Cycle time
CPU time Seconds Instructions
Cycles Seconds Program Program
Instruction Cycle
x
x
- Inst Count CPI Clock Rate
- Program X
- Compiler X (X)
- Inst. Set. X X
- Organization X X
- Technology X
41CPI (Cycles Per Instruction)
- Average Cycles per Instruction (for the program
in question) - CPI (CPU Time ? Clock Rate) / Instruction
Count Cycles / Instruction Count CPU time
Cycle Time ?
42And in conclusion
- Computer Architecture gtgt instruction sets
- Computer Architecture skill sets are different
- 5 Quantitative principles of design
- Quantitative approach to design
- Solid interfaces that really work
- Technology tracking and anticipation
- CS 620 to learn new skills, transition to
research - Computer Science at the crossroads from
sequential to parallel computing - Requires innovation in many fields, including
computer architecture
43CSCI 620 focuses on
Understanding the design techniques, machine
structures, technology factors, evaluation
methods that will determine the form of computers
in 21st Century
Parallelism
Technology
Programming
Languages
Applications
Interface Design (ISA)
Computer Architecture Organization
Hardware/Software Boundary
Compilers
Operating
Measurement Evaluation
History
Systems
44Moores Law 2X transistors / year
- Cramming More Components onto Integrated
Circuits - Gordon Moore, Electronics, 1965
- on transistors / cost-effective integrated
circuit double every N months (12 N 24)
45Trends in Technology(1.4)
- Drill down into 4 technologies
- Disks,
- Memory,
- Network,
- Processorsintegrated circuits
- Compare 1980 Archaic (Nostalgic) vs. 2000
Modern (Newfangled) - Performance Milestones in each technology
- Compare for Bandwidth vs. Latency improvements in
performance over time - Bandwidth Total amount of work done in a given
time - E.g., M bits/second over network, M bytes/second
from disk - Latency elapsed time for a single event from
start to finish - E.g., one-way network delay in microseconds,
average disk access time in milliseconds
46Disks Archaic(Nostalgic) v. Modern(Newfangled)
- Seagate 373453, 2003
- 15000 RPM (4X)
- 73.4 GBytes (2500X)
- Tracks/Inch 64000 (80X)
- Bits/Inch 533,000 (60X)
- Four 2.5 platters (in 3.5 form factor)
- Bandwidth 86 MBytes/sec (140X)
- Latency 5.7 ms (8X)
- Cache 8 MBytes
- CDC Wren I, 1983
- 3600 RPM
- 0.03 GBytes capacity
- Tracks/Inch 800
- Bits/Inch 9550
- Three 5.25 platters
- Bandwidth 0.6 MBytes/sec
- Latency 48.3 ms
- Cache none
47Latency Lags Bandwidth (for last 20 years)
- Performance Milestones
- 30 up to 1990
- 100 in 1996 up to 2004
- 30 since 2004
- Disk still 50-100 times cheaper than DRAM
- Disk 3600, 5400, 7200, 10000, 15000 RPM (8x,
143x)
(latency simple operation w/o contention BW
best-case)
48Memory Archaic (Nostalgic) v. Modern (Newfangled)
- 2000 Double Data Rate Synchr. (clocked) DRAM
- 256.00 Mbits/chip (4000X)
- 256,000,000 xtors, 204 mm2
- 64-bit data bus per DIMM, 66 pins/chip (4X)
- 1600 Mbytes/sec (120X)
- Latency 52 ns (4X)
- Block transfers (page mode)
- 1980 DRAM (asynchronous)
- 0.06 Mbits/chip
- 64,000 xtors, 35 mm2
- 16-bit data bus per module, 16 pins/chip
- 13 Mbytes/sec
- Latency 225 ns
- (no block transfer)
49Latency Lags Bandwidth (last 20 years)
- Performance Milestones
- Capacity increases 40/year2 times/2 years
- Memory Module 16 bit plain DRAM, Page Mode DRAM,
32b, 64b, SDRAM, DDR SDRAM (4x,120x) - Disk 3600, 5400, 7200, 10000, 15000 RPM (8x,
143x)
Latency
BW
Latency
BW
(latency simple operation w/o contention BW
best-case)
50LANs Archaic (Nostalgic)v. Modern (Newfangled)
- Ethernet 802.3
- Year of Standard 1978
- 10 Mbits/s link speed
- Latency 3000 msec
- Shared media
- Coaxial cable
- Ethernet 802.3ae
- Year of Standard 2003
- 10,000 Mbits/s (1000X)link speed
- Latency 190 msec (15X)
- Switched media
- Category 5 copper wire
Coaxial Cable
Plastic Covering
Braided outer conductor
Insulator
Copper core
51Latency Lags Bandwidth (last 20 years)
- Performance Milestones
- Ethernet 10Mb, 100Mb, 1000Mb, 10000 Mb/s
(16x,1000x) - Memory Module 16bit plain DRAM, Page Mode DRAM,
32b, 64b, SDRAM, DDR SDRAM (4x,120x) - Disk 3600, 5400, 7200, 10000, 15000 RPM (8x,
143x)
(latency simple operation w/o contention BW
best-case)
52CPUs Archaic (Nostalgic) v. Modern (Newfangled)
- 2001 Intel Pentium 4
- 1500 MHz (120X)
- 4500 MIPS (peak) (2250X)
- Latency 15 ns (20X)
- 42,000,000 xtors, 217 mm2
- 64-bit data bus, 423 pins
- 3-way superscalar,Dynamic translate to RISC,
Superpipelined (22 stage),Out-of-Order execution - On-chip 8KB Data caches, 96KB Instr. Trace
cache, 256KB L2 cache
- 1982 Intel 80286
- 12.5 MHz
- 2 MIPS (peak)
- Latency 320 ns
- 134,000 xtors, 47 mm2
- 16-bit data bus, 68 pins
- Microcode interpreter, separate FPU chip
- (no caches)
53Latency Lags Bandwidth (last 20 years)
- Performance Milestones
- Processor 286, 386, 486, Pentium, Pentium
Pro, Pentium 4 (21x,2250x) - Ethernet 10Mb, 100Mb, 1000Mb, 10000 Mb/s
(16x,1000x) - Memory Module 16bit plain DRAM, Page Mode DRAM,
32b, 64b, SDRAM, DDR SDRAM (4x,120x) - Disk 3600, 5400, 7200, 10000, 15000 RPM (8x,
143x)
54Rule of Thumb for Latency Lagging BW
- In the time that bandwidth doubles, latency
improves by no more than a factor of 1.2 to 1.4 - (and capacity improves faster than bandwidth)
- Stated alternatively Bandwidth improves by more
than the square of the improvement in Latency -
556 Reasons Latency Lags Bandwidth
- 1. Moores Law helps BW more than latency
- Faster transistors, more transistors, more pins
help Bandwidth - MPU Transistors 0.130 M vs. 42 M xtors (300X)
- DRAM Transistors 0.064 M vs. 256 M xtors
(4000X) - MPU Pins 68 vs. 423 pins (6X)
- DRAM Pins 16 vs. 66 pins (4X)
- Smaller, faster transistors but communicate over
(relatively) longer lines limits latency - Feature size 1.5 to 3 vs. 0.18 micron (8X,17X)
- MPU Die Size 35 vs. 204 mm2 (ratio sqrt ? 2X)
- DRAM Die Size 47 vs. 217 mm2 (ratio sqrt ?
2X)
566 Reasons Latency Lags Bandwidth (contd)
- 2. Distance limits latency
- Size of DRAM block ? long bit and word lines ?
most of DRAM access time - 3. Bandwidth easier to sell (biggerbetter)
- E.g., 10 Gbits/s Ethernet (10 Gig) vs. 10
msec latency Ethernet - 4400 MB/s DIMM (PC4400) vs. 50 ns latency
- Even if just marketing, customers now trained
- Since bandwidth sells, more resources thrown at
bandwidth, which further tips the balance
576 Reasons Latency Lags Bandwidth (contd)
- 4. Latency helps BW, but not vice versa
- Spinning disk faster improves both bandwidth and
rotational latency - 3600 RPM ? 15000 RPM 4.2X
- Average rotational latency 8.3 ms ? 2.0 ms
- Things being equal, also helps BW by 4.2X
- Lower DRAM latency ? More access/second (higher
bandwidth) - Higher linear density helps disk BW (and
capacity), but not disk Latency - 9,550 BPI ? 533,000 BPI ? 60X in BW
586 Reasons Latency Lags Bandwidth (contd)
- 5. Bandwidth hurts latency
- Adding chips to widen a memory module increases
Bandwidth but higher fan-out on address lines may
increase Latency - 6. Operating System overhead hurts Latency more
than Bandwidth - Long messages lessen the effect of overhead
overhead bigger part of short messages
59Summary of Technology Trends
- For disk, LAN, memory, and microprocessor,
bandwidth improves by square of latency
improvement - In the time that bandwidth doubles, latency
improves by no more than 1.2X to 1.4X - Lag probably even larger in real systems, as
bandwidth gains multiplied by replicated
components - Multiple processors in a cluster or even in a
chip - Multiple disks in a disk array
- Multiple memory modules in a large memory
- Simultaneous communication in switched LAN
- HW and SW developers should innovate assuming
Latency Lags Bandwidth - If everything improves at the same rate, then
nothing really changes - When rates vary, require real innovation
60Scaling in ICs
- Feature size minimum size of a single
transistor or a wire on a chip die - 1971 10 microns
- 2001 0.18 microns
- 2003 0.06 microns
- 2006 5 nanometers (0.005 microns)20001 ratio
from 1971 - Complex relationships bet. Performance feature
size - Transistor density increases quadratically with
decrease in feature size - Reduction in feature size requires voltage
reduction to maintain correct operation and
reasonable reliability - Scaling IC wiring
- Signal delay increases with product of resistance
and capacitance - Shorter wires not necessarily faster due to
increased resistance and capacitance
61Power Consumption of ICs(1.5)
- Energy requirements per transistor are
proportional to load capacitance, frequency of
switching and the square of the voltage. - Switching frequency and density of transistors
increases faster than decrease in capacitance and
voltage, leading to increased power consumption
generated heat - http//www.phys.ncku.edu.tw/htsu/humor/fry_egg.ht
ml - Pentium 4 consumes 100 Watts of power while the
8086-i386 did not even feature a heat-sink
62Cost and Price
- Cost of manufacturing decreases over time
learning curve - Learning curve is measured as an increase in
yield - Volume doubling leads to 10 reduction in cost
- Commodity products tend to decrease cost
- Volume
- Competition
- Efficiency
63Cost of Pentium--Figure 1.10 of text
64IC Manufacturing Process
Silicon Wafer
Patterned Silicon Wafer
Wafer Test
Silicon Cylinder
Packaged Die
Unpackaged Die
Final Test
65Wafers and Dies
- Chips are produced on round silicon diskswafers
- Dies are the actual chip, cut out from the wafer
- Testing occurs before cutting and after packaging
66Yield and Cost of chips
- However
- Wafers do not just contain chip-dies, usually a
large area, including several chip-dies, is
dedicated for test equipment hook-up - Actual yield in mass-production chip-fabs varies
between 98 for DRAMS to 1 for new Processors
67Yield and Cost
- Switch from 200mm to 300mm wafers
- Although 300mm wafers have lower yield than 200mm
wafers, the overhead processing costs per wafer
are high enough to make 300mm wafers more cost
effective. - Next candidate size is 450mm(2012?)
- Redundancy in dies
- Single transistors do fail during production,
causing memory cells, pipeline stages, control
logic sections to fail - Redundancy is built into the each die by
introducing backup-units - After testing, backup units are enabled and
failed units can be disabled by LASER - This decreases the chances of small flaws failing
an entire die - No company yet has released their redundant
circuitry numbers
68Difference between Cost and Price of IC
69Measuring Performance(1.5)
- Key measure is time.
- Response time (execution time) Time between
start and completion of a task. Includes
everythingcpu, I/O, timesharing, . - - CPU timeuser CPU time, system CPU time
- Throughput total amount of work completed in a
given time. - System performancefor unloaded system
- CPU performance for unloaded systemmain focus
in this chapter - Formula for user (program) CPU time is
70Comparing Processors(measured by running
programs)
X is n times faster than Y means
71Performance What to measuredecreasing order of
accuracy
- Real programs(What customers care) e.g.,
compilers, database, - -portability problemdependent upon OS other
hardware - Modified or scripted from real programs(modified
for portability) e.g., compression algorithms - Kernels small, key pieces from real program
e.g., Livermore Loops, Linpack. Better for
testing individual features - Toy benchmarks typically 10 to 100 lines of
code, useful primarily for intro programming
assignmentssmall, run on any computer e.g.,
quicksort, prime numbers, encryption - Synthetic benchmarks try to match average
frequency of operations and operands for a set of
programsno realistic progs e.g., Whetstone,
Dhrystone.
72Benchmark Suites
- Collections of benchmark applications, called
benchmark suites, are popular - SPECCPU popular desktop benchmark suite
- CPU only, split between integer and floating
point programs - SPECint2006 has 12 integer, SPECfp2006 has 17
floating point prgms - SPECCPU2006 V1.0 released August, 2006
- SPECSFS (NFS file server) and SPECWeb (WebServer)
added as server benchmarksfor other benchmarks,
see here - Transaction Processing Council measures server
performance and cost-performance for databases - TPC-C Complex query for Online Transaction
Processing - TPC-H models ad hoc decision support
- TPC-W a transactional web benchmark
- TPC-App application server and web services
benchmark
73SPECint2006
SPECfp2006
Figure 1.13
74How to Summarize Suite Performance (1/5)
- Arithmetic average of execution time of all pgms?
- But they vary by 4X in speed, so some would be
more important than others in arithmetic average - Could add a weights per program, but how pick
weight? - Different companies want different weights for
their products - SPECRatio Normalize execution times to reference
computer, yielding a ratio proportional to
performance - time on reference computer
- time on computer being rated
- In CPU2006, SUN Microsystems Ultra Enterprise 2
is used as reference computer
75How Summarize Suite Performance (2/5)
- If program SPECRatio(in CPU2006 it is called
Base) on Computer A is 1.25 times bigger than
Computer B, then
- Note that when comparing 2 computers as a ratio,
execution times on the reference computer drop
out, so choice of reference computer is
irrelevant
76How Summarize Suite Performance (3/5)
- Since ratios, proper mean is geometric mean
(SPECRatio unitless, so arithmetic mean
meaningless)
- Geometric mean of the ratios is the same as the
ratio of the geometric means - Ratio of geometric means Geometric mean of
performance ratios ? choice of reference
computer is irrelevant! - These two points make geometric mean of ratios
attractive to summarize performance
77How Summarize Suite Performance (4/5)
- Does a single mean well summarize performance of
programs in benchmark suite? - Can decide if mean a good predictor by
characterizing variability of distribution using
standard deviation - Like geometric mean, geometric standard deviation
is multiplicative rather than arithmetic - Can simply take the logarithm of SPECRatios,
compute the standard mean and standard deviation,
and then take the exponent to convert back
78How Summarize Suite Performance (5/5)
- Standard deviation is more informative if know
distribution has a standard form - bell-shaped normal distribution, whose data are
symmetric around mean - lognormal distribution, where logarithms of
data--not data itself--are normally distributed
(symmetric) on a logarithmic scale - For a lognormal distribution, we expect that
- 68 of samples fall in range
- 95 of samples fall in range
- Note Excel provides functions EXP(), LN(), and
STDEV() that make calculating geometric mean and
multiplicative standard deviation easy
79Example Standard Deviation (1/2)
- GM and multiplicative StDev of SPECfp2000 for
Itanium 2
80Example Standard Deviation (2/2)
- GM and multiplicative StDev of SPECfp2000 for AMD
Athlon
81Comments on Itanium 2 and Athlon
- Standard deviation of 1.98 for Itanium 2 is much
higher-- vs. 1.40--so results will differ more
widely from the mean, and therefore are likely
less predictable - Falling within one standard deviation
- 10 of 14 benchmarks (71) for Itanium 2
- 11 of 14 benchmarks (78) for Athlon
- Thus, the results are quite compatible with a
lognormal distribution (expect 68)
82And in conclusion
- Tracking and extrapolating technology part of
architects responsibility - Expect Bandwidth in disks, DRAM, network, and
processors to improve by at least as much as the
square of the improvement in Latency - Quantify dynamic and static power
- Capacitance x Voltage2 x frequency, Energy vs.
power - Quantify dependability
- Reliability (MTTF, FIT), Availability (99.9)
- Quantify and summarize performance
- Ratios, Geometric Mean, Multiplicative Standard
Deviation - Read Appendix A, record bugs online!