CSCI 620

About This Presentation

Title:

CSCI 620

Description:

Multiprogramming and time-sharing OS, multiuser applications ... PDAs, cell-phones, sensors, in cars = Price, Energy efficiency main concerns ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 83

Provided by: csuchicos

Learn more at: http://www.ecst.csuchico.edu

Category:

Tags: csci

more less

Transcript and Presenter's Notes

Title: CSCI 620

1

CSCI 620
Computer Architecture
Introduction
Seung Bae Im
Spring 2007
Reading Assignments
Chapter 1

2
Text Web pages

Text Computer Architecture A Quantitative
Approach, Fourth Edition, Hennessy Patterson
Web Page www.ecst.csuhico.edu/sim

3
Prerequisite Knowledge (CSCI 320)

Assembly language programming
Fundamentals of logic design
Combinational and sequential logics (e.g., gates,
multiplexers, decoders, ALU, ROMs,
flip-flops, registers, RAMs)
Processor Design
Instruction cycle, pipelining, branch prediction,
exceptions
Memory Hierarchy
Caches (direct-mapped, fully-associative, n-way
set associative), spatial locality, temporal
locality, virtual memory, translation lookaside
buffer (TLB)
Disk systems
Input and Output
Polling, interrupts
Multiprocessors

4
Course Requirements

Homeworks Specifications on web page
1 Midterm Exam Final Exam
Course project Specifications on web page
(A report plus a group presentation at the end
of the semester)

Course Evaluation Homeworks 15 Midterm
exam 30 Final exam 30 Project 25
5
What is Computer Architecture?

Two viewpoints
Hardware designers viewpoint CPUs, caches,
buses, pipelines, memory, ILP, etc.
Programmers viewpoint instruction set
opcodes, addressing modes, registers, virtual
memory, etc. ISA(Instruction Set Architecture)
? Study of architecture covers both

6
Computer Architecture Is ...

The attributes of a computing system as seen
by the programmer, i.e., the conceptual structure
and functional behavior, as distinct from the
organization of the data flows and controls the
logic design, and the physical implementation.
Amdahl, Blaauw, and Brooks, 1964

7
Why learn computer architecture?

How the hardware work?
Where is the hardware heading toward?
How does the advances in hardware affect system
software and applications?
-Understand OS/compilers/Programming Language
-Helps us to write better code
-Use computers more efficiently
Future of computing technology?

8
Computer Architectures Changing Definition

1950s to 1960s
Computer Architecture Course Computer
Arithmetic. Logic design
1970s to 1980s
Computer Architecture Course Instruction Set
Design, especially ISA appropriate for compilers.
CISC vs RISC
1990s and beyond
Computer Architecture Course Design of CPU,
memory system, I/O system, multiprocessors.

9
5 Generations of Electronic Computers(60 years)
1st Generation (1945-54) Beginning (Vacuum tubes) Vacuum tubes, relay memories, CPU using PC, accumulator, fixed point arithmetic Machine/assembly languages, single user, batch processing, no subroutine linkages, programmed I/O (controlled by CPU) ENIAC, IBM 701
2nd Generation (1955-64) Mainframes (Transistors) Discrete transistors, core memories, floating-point arithmetic, I/O processors, multiplexed memory access High Level Languages using compilers, subroutines, libraries, batch processing terminals IBM 7030, CDC 1604, Univac LARC (mainframes)
3rd Generation (1965-74) Mainframe, Miniframe (Integrated Circuits) Integrated Circuits (SSI/MSI), pipelining, cache, lookahead Multiprogramming and time-sharing OS, multiuser applications IBM 360/370, CDC 6600, TI-ASC, PDP-8 (mainframe, miniframe)
4th Generation (1975-90) Microprocessors (VLSI) LSI/VLSI, semiconductor memory,multiprocessors, vector supercomputers, multiprocessors, parallel processors Multiprocessor OS, Higher Level Languages, advanced compilers, parallel processing, RISC ISA VAX/9000, CRAY X/MP, IBM 3090, Intel, Motorola, MIPS
5th Generation (1991-present) Dominance of microprocessors ULSI/VHSIC, high density memories, scalable architectures, Massively parallel processing, Instruction Level Parallelism, large/complex applications, ubiquitous computing IBM/MPP, CRAY/MPP, TMC/CM-5, Intel Paragon
10
Moores Law

In 1965, Intel co-founder Gordon Moore saw the
future. His prediction, now popularly known as
Moore's Law, states that the number of
transistors on a chip doubles about every two
years. This observation about silicon
integration, made a reality by Intel, the world's
largest silicon supplier, has fueled the
worldwide technology revolution.
Every 18-24 months
Feature sizes shrink by 0.7x
Number of transistors per die increases by 2x
Speed of transistors increases by 1.4x
But we are starting to hit some roadblocks
Also, what to do with all of these
transistors???

11
Moores Law (Transistor density)Intel only
Moore's Law Means More Performance. Processing
power, measured in millions of instructions per
second (MIPS), has steadily risen because of
increased transistor counts.
12
The transistors have been used to

More functionality on one chip
Early 1980s 32-bit microprocessors
Late 1980s On Chip Level 1 Caches
Early/Mid 1990s 64-bit microprocessors,
superscalar (ILP)
Late 1990s On Chip Level 2 Caches
Early 2000s Chip-level Multiprocessors,
On-Chip Level 3 Caches
What next?
How much cache on a chip? (Itanium 2up to
level 3 on-chip)
How many cores on a chip? ( SUN Niagara8
CPUs)
What else can we put on chips?
The Trend to Many-Core from Intel siteRattner
introduced the "many-core" concept, explaining
that Intel researchers and scientists are
experimenting with "many tens of cores,
potentially even hundreds of cores per die, per
single processor die. And those cores will be
supporting tens, hundreds, maybe even thousands
of simultaneous execution threads."

13
Crossroads Conventional Wisdom in Comp. Arch

Old Conventional Wisdom Power is free,
Transistors expensive
New Conventional Wisdom Power wall Power
expensive, Xtors free (Can put more on chip than
can afford to turn on)
Old CW Sufficiently increasing Instruction Level
Parallelism via compilers, innovation
(Out-of-order, speculation, VLIW, )
New CW ILP wall law of diminishing returns on
more HW for ILP
Old CW Multiplies are slow, Memory access is
fast
New CW Memory wall Memory slow, multiplies
fast (200 clock cycles to DRAM memory, 4 clocks
for multiply)
Old CW Uniprocessor performance 2X / 1.5 yrs
New CW Power Wall ILP Wall Memory Wall
Brick Wall
Uniprocessor performance now 2X / 5(?) yrs
? Sea change in chip design multiple cores
(2X processors per chip / 2 years)
More simpler processors are more power efficient

14
Power wall
15
Crossroads Uniprocessor Performance
From Hennessy and Patterson, Computer
Architecture A Quantitative Approach, 4th
edition, October, 2006

VAX 25/year 1978 to 1986
RISC x86 52/year 1986 to 2002
RISC x86 ??/year 2002 to presentabout
20/year

VAX
16
Performance of Microprocessors (previous page)

Up to late 70s(before the birth of
microprocessors), mainframe or miniframe
computers dominated. Their performance improved
constantly
The improvement was due to the technological
advances in electronics and innovation in
computer architecturetheir growth rate was
25-30 per year
From late 70s, microprocessors (which benefited
more from IC technologies than main/miniframes)
have shown 35 growth rate during early 80s
During 80s, a new ISA(Instruction Set
Architecture), RISC architecture evolved which
was possible due to two reasons
- Less object-code compatibility due to the
usage of HLLless assembly language
programs
-Emergence of platform-independent OS(Unix,
Linux)
Effect of the above resulted in the performance
growth(of microprocessors) at the rate of 1.52
times/year
Result of this rapid growth
- Enhanced capability to userslarge, complex
programs data easily accessible
- Dominance of microprocessor-based computers
across entire range of computersworkstations,
servers, embedded systems, multiprocessor
supercomputers

17
Changes in Chip Design

Intel 4004 (1971) 4-bit processor,2312
transistors, 0.4 MHz, 10 micron PMOS, 11 mm2
chip

RISC II (1983) 32-bit, 5 stage pipeline, 40,760
transistors, 3 MHz, 3 micron NMOS, 60 mm2 chip

125 mm2 chip, 0.065 micron CMOS 2312 RISC
IIFPUIcacheDcache
RISC II shrinks to 0.02 mm2 at 65 nm
Caches via DRAM or 1 transistor SRAM
(www.t-ram.com) ?
Proximity Communication via capacitive coupling
at gt 1 TB/s ?
Wafer stacking(3D die)

Processor is the new transistor?

18
New Conventional Wisdom

2002-date? only 20/year due to power limit for
air-cooled chips, no more room for ILP
improvement, no improvement on memory latency
Focus changed from high perform uniproc. ?
multiproc. on a chip
From ILP? TLP(Thread Level Parallelism)
DLP(Data Level Parallelism)

19
Problems with the Change

Algorithms, Programming Languages, Compilers,
Operating Systems, Architectures, Libraries,
not ready to supply Thread Level Parallelism or
Data Level Parallelism for 1000 CPUs / chip,
Architectures not ready for 1000 CPUs / chip
Unlike Instruction Level Parallelism, cannot be
solved by just by computer architects and
compiler writers alone, but also cannot be solved
without participation of computer architects
This 4th Edition of textbook Computer
Architecture A Quantitative Approach) explores
shift from Instruction Level Parallelism to
Thread Level Parallelism / Data Level Parallelism

20
ILP ? TLP DLP

ILPcompilers and hardware exploit ILP so that
programmers do not worry
TLP DLPrequires programmers to write parallel
programs to utilize the parallelism

21
Instruction Set Architecture Critical Interface
software
instruction set
hardware

Properties of a good abstraction
Lasts through many generations (portability)
Used in many different ways (generality)
Provides convenient functionality to higher
levels
Permits an efficient implementation at lower
levels

22
Example MIPS
Programmable storage 232 x bytes 31 x 32-bit
GPRs (R00) 32 x 32-bit FP regs (paired DP) HI,
LO, PC
Data types ? Format ? Addressing Modes?
Arithmetic logical Add, AddU, Sub, SubU,
And, Or, Xor, Nor, SLT, SLTU, AddI, AddIU,
SLTI, SLTIU, AndI, OrI, XorI, LUI SLL, SRL, SRA,
SLLV, SRLV, SRAV Memory Access LB, LBU, LH, LHU,
LW, LWL,LWR SB, SH, SW, SWL, SWR Control J,
JAL, JR, JALR BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZA
L,BGEZAL
32-bit instructions on word boundary
23
Instruction Set Architecture

... the attributes of a computing system as
seen by the programmer, i.e. the conceptual
structure and functional behavior, as distinct
from the organization of the data flows and
controls the logic design, and the physical
implementation.
Amdahl, Blaauw, and Brooks, 1964

-- Organization of Programmable Storage --
Data Types Data Structures Encodings
Representations -- Instruction Formats --
Instruction (or Operation Code) Set -- Modes of
Addressing and Accessing Data Items and
Instructions -- Exceptional Conditionsinterrupt
s, faults
24
ISA vs. Computer Architecture

Old definition of computer architecture
instruction set design
Other aspects of computer design called
implementation
Insinuates that implementation is uninteresting
or less challenging
Our view is Computer architecture gtgt ISA
Architects job much more than instruction set
design technical hurdles today more challenging
than those in instruction set design
Since instruction set design not where action is,
some conclude computer architecture (using old
definition) is not where action is
We disagree on conclusion
Agree that ISA not where action is (ISA in CAAQA
4/e appendix)

25
Comp. Arch. is an Integrated Approach

What really matters is the functioning of the
complete system
hardware, runtime system, compiler, operating
system, and application
In networking, this is called the End to End
communication
Computer architecture is not just about
transistors, individual instructions, or
particular implementations
E.g., Original RISC projects replaced complex
instructions with a compiler simple instructions

26
Classes of computers and application areas

Desktop Computing general purpose desktop/laptop
- Increased productivity, interactive graphics,
video, audio
- Web-centric, interactive applications demand
more processing power
- Optimized price-performancetoo much focus on
clock rate
- Examples Intel Pentium, AMD Athlon XP
Servers powerful, large-scale file and computing
services
- Internet opened client-server computing
paradigm
- Web-based services require high computing
power
- Large database, transaction processing, search
engines
- Scientific applicationsweather, defense, gene
researchIBM BlueGene
- Performance, Availability(24/7),
Scalability(growth)
- Server downtime cost more than 6M/hour for a
brokerage company(Fig 1.2)
- Examples Sun Fire E20K, IBM eServer p5,
Google Cluster
Embedded Computers computers embedded in other
devices
- PDAs, cell-phones, sensors, in cars gt Price,
Energy efficiencymain concerns
- Game machines, Network uPs(Routers, switches)
gt Price-Performance ratio
- Examples ARM926EJ-S, Sony Emotion Engine, IBM
750FX

27
What Computer Architecture brings to Table

Other fields often borrow ideas from architecture
Quantitative Principles of Design
Take Advantage of Parallelism
Principle of Locality
Focus on the Common Case
Amdahls Law
The Processor Performance Equation
Careful, quantitative comparisons
Define, quantify, and summarize relative
performance
Define and quantify relative cost
Define and quantify dependability
Define and quantify power
Culture of anticipating and exploiting advances
in technology
Culture of well-defined interfaces that are
carefully implemented and thoroughly checked

28
1) Taking Advantage of Parallelism

Increasing throughput of computer via multiple
processors or multiple disks
Detailed HW design
Carry lookahead adders uses parallelism to speed
up computing sums from linear to logarithmic in
number of bits per operand
Multiple memory banks searched in parallel in
set-associative caches
Pipelining overlap instruction execution to
reduce the total time to complete an instruction
sequence.
Not every instruction depends on immediate
predecessor ? executing instructions
completely/partially in parallel possible
Classic 5-stage pipeline 1) Instruction Fetch
(Ifetch), 2) Register Read (Reg), 3) Execute
(ALU), 4) Data Memory Access (Dmem), 5)
Register Write (Reg)

29
Pipelined Instruction Execution
30
Limits to pipelining

Hazards prevent next instruction from executing
during its designated clock cycle
Structural hazards attempt to use the same
hardware to do two different things at once
Data hazards Instruction depends on result of
prior instruction still in the pipeline
Control hazards Caused by delay between the
fetching of instructions and decisions about
changes in control flow (branches and jumps).

Time (clock cycles)
I n s t r. O r d e r
31
2) The Principle of Locality

The Principle of Locality
Program access a relatively small portion of the
address space at any instant of time.
Two Different Types of Locality
Temporal Locality (Locality in Time) If an item
is referenced, it will tend to be referenced
again soon (e.g., loops, reuse)
Spatial Locality (Locality in Space) If an item
is referenced, items whose addresses are close by
tend to be referenced soon (e.g., straight-line
code, array access)
Last 30 years, HW relied on locality for memory
perf.

32
Levels of the Memory Hierarchy
Capacity Access Time Cost
Upper Level
Staging Xfer Unit
CPU Registers 100s Bytes 300 500 ps (0.3-0.5 ns)
Registers
faster
prog./compiler 1-8 bytes
Instr. Operands
L1 Cache
L1 and L2 Cache 10s-100s K Bytes 1 ns - 10
ns 1000s/ GByte
cache cntl 32-64 bytes
Blocks
L2 Cache
cache cntl 64-128 bytes
Blocks
Main Memory G Bytes 80ns- 200ns 100/ GByte
Memory
OS 4K-8K bytes
Pages
Disk 10s T Bytes, 10 ms (10,000,000 ns) 1 /
GByte
Disk
user/operator Mbytes
Files
Larger
Tape infinite sec-min 1 / GByte
Tape
Lower Level
33
3) Focus on the Common Case

Common sense guides computer design
Since its engineering, common sense is valuable
In making a design trade-off, favor the frequent
case over the infrequent case
E.g., Instruction fetch and decode unit used more
frequently than multiplier, so optimize it 1st
E.g., If database server has 50 disks /
processor, storage dependability dominates system
dependability, so optimize it 1st
Frequent case is often simpler and can be done
faster than the infrequent case
E.g., overflow is rare when adding 2 numbers, so
improve performance by optimizing more common
case of no overflow
May slow down overflow, but overall performance
improved by optimizing for the normal case
What is frequent case and how much performance
improved by making case faster gt Amdahls Law

34
4) Amdahls Law
Best you could ever hope to do
35
Amdahls Law example

New CPU 10X faster
I/O bound server, so 60 time waiting for I/O

Apparently, it is human nature to be attracted by
10X faster, vs. keeping in perspective its just
1.6X faster

36
(No Transcript)
37
Another Example
Suppose we can enhance the performance of a CPU
by adding a vectorization mode that can be used
under certain conditions, that will compute 10x
faster than normal mode. What percent of the run
time must be in vectorization mode to achieve
overall speedup of 2?
38
Metrics of Performance
Application
Answers per day/month Operations per second
Programming Language
Compiler
Millions of instructions per second
MIPS Millions of FP operations per second MFLOPS
ISA
Datapath
Megabytes per second
Control
Function Units
Cycles per second (clock rate)
Transistors
Wires
Pins
39
Marketing Metrics

How about machines with different instruction
sets?
Programs with different instruction mixes?
Dynamic frequency of instructions
It is not closely related with performance

Often not where time is spenton most programs,
there are not many FP instructions executed
Therefore ? next slide

40
5) Processor performance equation
CPI
Time taken to run a program
inst count
Cycle time
CPU time Seconds Instructions
Cycles Seconds Program Program
Instruction Cycle
x
x

Inst Count CPI Clock Rate
Program X
Compiler X (X)
Inst. Set. X X
Organization X X
Technology X

41
CPI (Cycles Per Instruction)

Average Cycles per Instruction (for the program
in question)
CPI (CPU Time ? Clock Rate) / Instruction
Count Cycles / Instruction Count CPU time
Cycle Time ?

42
And in conclusion

Computer Architecture gtgt instruction sets
Computer Architecture skill sets are different
5 Quantitative principles of design
Quantitative approach to design
Solid interfaces that really work
Technology tracking and anticipation
CS 620 to learn new skills, transition to
research
Computer Science at the crossroads from
sequential to parallel computing
Requires innovation in many fields, including
computer architecture

43
CSCI 620 focuses on
Understanding the design techniques, machine
structures, technology factors, evaluation
methods that will determine the form of computers
in 21st Century
Parallelism
Technology
Programming
Languages
Applications
Interface Design (ISA)
Computer Architecture Organization
Hardware/Software Boundary
Compilers
Operating
Measurement Evaluation
History
Systems
44
Moores Law 2X transistors / year

Cramming More Components onto Integrated
Circuits
Gordon Moore, Electronics, 1965
on transistors / cost-effective integrated
circuit double every N months (12 N 24)

45
Trends in Technology(1.4)

Drill down into 4 technologies
Disks,
Memory,
Network,
Processorsintegrated circuits
Compare 1980 Archaic (Nostalgic) vs. 2000
Modern (Newfangled)
Performance Milestones in each technology
Compare for Bandwidth vs. Latency improvements in
performance over time
Bandwidth Total amount of work done in a given
time
E.g., M bits/second over network, M bytes/second
from disk
Latency elapsed time for a single event from
start to finish
E.g., one-way network delay in microseconds,
average disk access time in milliseconds

46
Disks Archaic(Nostalgic) v. Modern(Newfangled)

Seagate 373453, 2003
15000 RPM (4X)
73.4 GBytes (2500X)
Tracks/Inch 64000 (80X)
Bits/Inch 533,000 (60X)
Four 2.5 platters (in 3.5 form factor)
Bandwidth 86 MBytes/sec (140X)
Latency 5.7 ms (8X)
Cache 8 MBytes

CDC Wren I, 1983
3600 RPM
0.03 GBytes capacity
Tracks/Inch 800
Bits/Inch 9550
Three 5.25 platters
Bandwidth 0.6 MBytes/sec
Latency 48.3 ms
Cache none

47
Latency Lags Bandwidth (for last 20 years)

Performance Milestones
30 up to 1990
100 in 1996 up to 2004
30 since 2004
Disk still 50-100 times cheaper than DRAM
Disk 3600, 5400, 7200, 10000, 15000 RPM (8x,
143x)

(latency simple operation w/o contention BW
best-case)
48
Memory Archaic (Nostalgic) v. Modern (Newfangled)

2000 Double Data Rate Synchr. (clocked) DRAM
256.00 Mbits/chip (4000X)
256,000,000 xtors, 204 mm2
64-bit data bus per DIMM, 66 pins/chip (4X)
1600 Mbytes/sec (120X)
Latency 52 ns (4X)
Block transfers (page mode)

1980 DRAM (asynchronous)
0.06 Mbits/chip
64,000 xtors, 35 mm2
16-bit data bus per module, 16 pins/chip
13 Mbytes/sec
Latency 225 ns
(no block transfer)

49
Latency Lags Bandwidth (last 20 years)

Performance Milestones
Capacity increases 40/year2 times/2 years
Memory Module 16 bit plain DRAM, Page Mode DRAM,
32b, 64b, SDRAM, DDR SDRAM (4x,120x)
Disk 3600, 5400, 7200, 10000, 15000 RPM (8x,
143x)

Latency
BW
Latency
BW
(latency simple operation w/o contention BW
best-case)
50
LANs Archaic (Nostalgic)v. Modern (Newfangled)

Ethernet 802.3
Year of Standard 1978
10 Mbits/s link speed
Latency 3000 msec
Shared media
Coaxial cable

Ethernet 802.3ae
Year of Standard 2003
10,000 Mbits/s (1000X)link speed
Latency 190 msec (15X)
Switched media
Category 5 copper wire

Coaxial Cable
Plastic Covering
Braided outer conductor
Insulator
Copper core
51
Latency Lags Bandwidth (last 20 years)

Performance Milestones
Ethernet 10Mb, 100Mb, 1000Mb, 10000 Mb/s
(16x,1000x)
Memory Module 16bit plain DRAM, Page Mode DRAM,
32b, 64b, SDRAM, DDR SDRAM (4x,120x)
Disk 3600, 5400, 7200, 10000, 15000 RPM (8x,
143x)

(latency simple operation w/o contention BW
best-case)
52
CPUs Archaic (Nostalgic) v. Modern (Newfangled)

2001 Intel Pentium 4
1500 MHz (120X)
4500 MIPS (peak) (2250X)
Latency 15 ns (20X)
42,000,000 xtors, 217 mm2
64-bit data bus, 423 pins
3-way superscalar,Dynamic translate to RISC,
Superpipelined (22 stage),Out-of-Order execution
On-chip 8KB Data caches, 96KB Instr. Trace
cache, 256KB L2 cache

1982 Intel 80286
12.5 MHz
2 MIPS (peak)
Latency 320 ns
134,000 xtors, 47 mm2
16-bit data bus, 68 pins
Microcode interpreter, separate FPU chip
(no caches)

53
Latency Lags Bandwidth (last 20 years)

Performance Milestones
Processor 286, 386, 486, Pentium, Pentium
Pro, Pentium 4 (21x,2250x)
Ethernet 10Mb, 100Mb, 1000Mb, 10000 Mb/s
(16x,1000x)
Memory Module 16bit plain DRAM, Page Mode DRAM,
32b, 64b, SDRAM, DDR SDRAM (4x,120x)
Disk 3600, 5400, 7200, 10000, 15000 RPM (8x,
143x)

54
Rule of Thumb for Latency Lagging BW

In the time that bandwidth doubles, latency
improves by no more than a factor of 1.2 to 1.4
(and capacity improves faster than bandwidth)
Stated alternatively Bandwidth improves by more
than the square of the improvement in Latency

55
6 Reasons Latency Lags Bandwidth

1. Moores Law helps BW more than latency
Faster transistors, more transistors, more pins
help Bandwidth
MPU Transistors 0.130 M vs. 42 M xtors (300X)
DRAM Transistors 0.064 M vs. 256 M xtors
(4000X)
MPU Pins 68 vs. 423 pins (6X)
DRAM Pins 16 vs. 66 pins (4X)
Smaller, faster transistors but communicate over
(relatively) longer lines limits latency
Feature size 1.5 to 3 vs. 0.18 micron (8X,17X)
MPU Die Size 35 vs. 204 mm2 (ratio sqrt ? 2X)
DRAM Die Size 47 vs. 217 mm2 (ratio sqrt ?
2X)

56
6 Reasons Latency Lags Bandwidth (contd)

2. Distance limits latency
Size of DRAM block ? long bit and word lines ?
most of DRAM access time
3. Bandwidth easier to sell (biggerbetter)
E.g., 10 Gbits/s Ethernet (10 Gig) vs. 10
msec latency Ethernet
4400 MB/s DIMM (PC4400) vs. 50 ns latency
Even if just marketing, customers now trained
Since bandwidth sells, more resources thrown at
bandwidth, which further tips the balance

57
6 Reasons Latency Lags Bandwidth (contd)

4. Latency helps BW, but not vice versa
Spinning disk faster improves both bandwidth and
rotational latency
3600 RPM ? 15000 RPM 4.2X
Average rotational latency 8.3 ms ? 2.0 ms
Things being equal, also helps BW by 4.2X
Lower DRAM latency ? More access/second (higher
bandwidth)
Higher linear density helps disk BW (and
capacity), but not disk Latency
9,550 BPI ? 533,000 BPI ? 60X in BW

58
6 Reasons Latency Lags Bandwidth (contd)

5. Bandwidth hurts latency
Adding chips to widen a memory module increases
Bandwidth but higher fan-out on address lines may
increase Latency
6. Operating System overhead hurts Latency more
than Bandwidth
Long messages lessen the effect of overhead
overhead bigger part of short messages

59
Summary of Technology Trends

For disk, LAN, memory, and microprocessor,
bandwidth improves by square of latency
improvement
In the time that bandwidth doubles, latency
improves by no more than 1.2X to 1.4X
Lag probably even larger in real systems, as
bandwidth gains multiplied by replicated
components
Multiple processors in a cluster or even in a
chip
Multiple disks in a disk array
Multiple memory modules in a large memory
Simultaneous communication in switched LAN
HW and SW developers should innovate assuming
Latency Lags Bandwidth
If everything improves at the same rate, then
nothing really changes
When rates vary, require real innovation

60
Scaling in ICs

Feature size minimum size of a single
transistor or a wire on a chip die
1971 10 microns
2001 0.18 microns
2003 0.06 microns
2006 5 nanometers (0.005 microns)20001 ratio
from 1971
Complex relationships bet. Performance feature
size
Transistor density increases quadratically with
decrease in feature size
Reduction in feature size requires voltage
reduction to maintain correct operation and
reasonable reliability
Scaling IC wiring
Signal delay increases with product of resistance
and capacitance
Shorter wires not necessarily faster due to
increased resistance and capacitance

61
Power Consumption of ICs(1.5)

Energy requirements per transistor are
proportional to load capacitance, frequency of
switching and the square of the voltage.
Switching frequency and density of transistors
increases faster than decrease in capacitance and
voltage, leading to increased power consumption
generated heat
http//www.phys.ncku.edu.tw/htsu/humor/fry_egg.ht
ml
Pentium 4 consumes 100 Watts of power while the
8086-i386 did not even feature a heat-sink

62
Cost and Price

Cost of manufacturing decreases over time
learning curve
Learning curve is measured as an increase in
yield
Volume doubling leads to 10 reduction in cost
Commodity products tend to decrease cost
Volume
Competition
Efficiency

63
Cost of Pentium--Figure 1.10 of text
64
IC Manufacturing Process
Silicon Wafer
Patterned Silicon Wafer
Wafer Test
Silicon Cylinder
Packaged Die
Unpackaged Die
Final Test
65
Wafers and Dies

Chips are produced on round silicon diskswafers
Dies are the actual chip, cut out from the wafer
Testing occurs before cutting and after packaging

66
Yield and Cost of chips

However
Wafers do not just contain chip-dies, usually a
large area, including several chip-dies, is
dedicated for test equipment hook-up
Actual yield in mass-production chip-fabs varies
between 98 for DRAMS to 1 for new Processors

67
Yield and Cost

Switch from 200mm to 300mm wafers
Although 300mm wafers have lower yield than 200mm
wafers, the overhead processing costs per wafer
are high enough to make 300mm wafers more cost
effective.
Next candidate size is 450mm(2012?)
Redundancy in dies
Single transistors do fail during production,
causing memory cells, pipeline stages, control
logic sections to fail
Redundancy is built into the each die by
introducing backup-units
After testing, backup units are enabled and
failed units can be disabled by LASER
This decreases the chances of small flaws failing
an entire die
No company yet has released their redundant
circuitry numbers

68
Difference between Cost and Price of IC
69
Measuring Performance(1.5)

Key measure is time.
Response time (execution time) Time between
start and completion of a task. Includes
everythingcpu, I/O, timesharing, .
- CPU timeuser CPU time, system CPU time
Throughput total amount of work completed in a
given time.
System performancefor unloaded system
CPU performance for unloaded systemmain focus
in this chapter
Formula for user (program) CPU time is

70
Comparing Processors(measured by running
programs)
X is n times faster than Y means
71
Performance What to measuredecreasing order of
accuracy

Real programs(What customers care) e.g.,
compilers, database,
-portability problemdependent upon OS other
hardware
Modified or scripted from real programs(modified
for portability) e.g., compression algorithms
Kernels small, key pieces from real program
e.g., Livermore Loops, Linpack. Better for
testing individual features
Toy benchmarks typically 10 to 100 lines of
code, useful primarily for intro programming
assignmentssmall, run on any computer e.g.,
quicksort, prime numbers, encryption
Synthetic benchmarks try to match average
frequency of operations and operands for a set of
programsno realistic progs e.g., Whetstone,
Dhrystone.

72
Benchmark Suites

Collections of benchmark applications, called
benchmark suites, are popular
SPECCPU popular desktop benchmark suite
CPU only, split between integer and floating
point programs
SPECint2006 has 12 integer, SPECfp2006 has 17
floating point prgms
SPECCPU2006 V1.0 released August, 2006
SPECSFS (NFS file server) and SPECWeb (WebServer)
added as server benchmarksfor other benchmarks,
see here
Transaction Processing Council measures server
performance and cost-performance for databases
TPC-C Complex query for Online Transaction
Processing
TPC-H models ad hoc decision support
TPC-W a transactional web benchmark
TPC-App application server and web services
benchmark

73
SPECint2006
SPECfp2006
Figure 1.13
74
How to Summarize Suite Performance (1/5)

Arithmetic average of execution time of all pgms?
But they vary by 4X in speed, so some would be
more important than others in arithmetic average
Could add a weights per program, but how pick
weight?
Different companies want different weights for
their products
SPECRatio Normalize execution times to reference
computer, yielding a ratio proportional to
performance
time on reference computer
time on computer being rated
In CPU2006, SUN Microsystems Ultra Enterprise 2
is used as reference computer

75
How Summarize Suite Performance (2/5)

If program SPECRatio(in CPU2006 it is called
Base) on Computer A is 1.25 times bigger than
Computer B, then

Note that when comparing 2 computers as a ratio,
execution times on the reference computer drop
out, so choice of reference computer is
irrelevant

76
How Summarize Suite Performance (3/5)

Since ratios, proper mean is geometric mean
(SPECRatio unitless, so arithmetic mean
meaningless)

Geometric mean of the ratios is the same as the
ratio of the geometric means
Ratio of geometric means Geometric mean of
performance ratios ? choice of reference
computer is irrelevant!
These two points make geometric mean of ratios
attractive to summarize performance

77
How Summarize Suite Performance (4/5)

Does a single mean well summarize performance of
programs in benchmark suite?
Can decide if mean a good predictor by
characterizing variability of distribution using
standard deviation
Like geometric mean, geometric standard deviation
is multiplicative rather than arithmetic
Can simply take the logarithm of SPECRatios,
compute the standard mean and standard deviation,
and then take the exponent to convert back

78
How Summarize Suite Performance (5/5)

Standard deviation is more informative if know
distribution has a standard form
bell-shaped normal distribution, whose data are
symmetric around mean
lognormal distribution, where logarithms of
data--not data itself--are normally distributed
(symmetric) on a logarithmic scale
For a lognormal distribution, we expect that
68 of samples fall in range
95 of samples fall in range
Note Excel provides functions EXP(), LN(), and
STDEV() that make calculating geometric mean and
multiplicative standard deviation easy

79
Example Standard Deviation (1/2)

GM and multiplicative StDev of SPECfp2000 for
Itanium 2

80
Example Standard Deviation (2/2)

GM and multiplicative StDev of SPECfp2000 for AMD
Athlon

81
Comments on Itanium 2 and Athlon

Standard deviation of 1.98 for Itanium 2 is much
higher-- vs. 1.40--so results will differ more
widely from the mean, and therefore are likely
less predictable
Falling within one standard deviation
10 of 14 benchmarks (71) for Itanium 2
11 of 14 benchmarks (78) for Athlon
Thus, the results are quite compatible with a
lognormal distribution (expect 68)

82
And in conclusion

Tracking and extrapolating technology part of
architects responsibility
Expect Bandwidth in disks, DRAM, network, and
processors to improve by at least as much as the
square of the improvement in Latency
Quantify dynamic and static power
Capacitance x Voltage2 x frequency, Energy vs.
power
Quantify dependability
Reliability (MTTF, FIT), Availability (99.9)
Quantify and summarize performance
Ratios, Geometric Mean, Multiplicative Standard
Deviation
Read Appendix A, record bugs online!