OMSE 510: Computing Foundations 2: Disks, Buses, DRAM

About This Presentation

Title:

OMSE 510: Computing Foundations 2: Disks, Buses, DRAM

Description:

Seagate Barracuda: 7200 RPM (Disks these days are 3600, 4800, 5400, 7200 up to 10800 RPM) ... Desktop: Seagate 250GB, 7200RPM, SATA II, 9-11ms seek. Buffer to ... – PowerPoint PPT presentation

Number of Views:174

Avg rating:3.0/5.0

Slides: 126

Provided by: franci52

Learn more at: http://web.cecs.pdx.edu

Category:

more less

Transcript and Presenter's Notes

Title: OMSE 510: Computing Foundations 2: Disks, Buses, DRAM

1
OMSE 510 Computing Foundations2 Disks, Buses,
DRAM

Portland State University/OMSE

2
Outline of Comp. Architecture
Outline of the rest of the computer architecture
section Start with a description of computer
devices, work back towards the CPU.

3
Computer Architecture Is

the attributes of a computing system as seen
by the programmer, i.e., the conceptual structure
and functional behavior, as distinct from the
organization of the data flows and controls the
logic design, and the physical implementation.
Amdahl, Blaaw, and Brooks, 1964

SOFTWARE
4
Today

Begin Computer Architecture
Disk Drives
The Bus
Memory

5
Computer System (Idealized)
Disk
Memory
CPU
Disk Controller
6
I/O Device Examples
Device Behavior Partner Data Rate
(KB/sec) Keyboard Input Human
0.01 Mouse Input Human
0.02 Line Printer Output Human
1.00 Floppy disk Storage Machine
50.00 Laser Printer Output Human
100.00 Optical Disk Storage Machine
500.00 Magnetic Disk Storage Machine
5,000.00 Network-LAN Input or Output
Machine 20 1,000.00 Graphics Display
Output Human 30,000.00

7
A Device The Disk
Disk Drives! - eg. Your hard disk drive -
Where files are physically stored - Long-term
non-volatile storage device

8
Magnetic Drum
9
Spiral Format for Compact Disk
10
A Device The Disk
Magnetic Disks - Your hard disk drive - Where
files are physically stored - Long-term
non-volatile storage device

11
A Magnetic Disk with Three Platters
12
Organization of a Disk Platter with a 12
Interleave Factor
13
Disk Physical Characteristics

Platters
1 to 20 with diameters from 1.3 to 8 inches
(Recording on both sides)
Tracks
2500 to 5000 Tracks/inch
Cylinders
all tracks in the same position in the platters
Sectors
128-256 sectors/track with gaps and info related
to sectors between them (typical sector, 256-512
bytes)

14
Disk Physical Characteristics

Trend as of 2005
Constant bit density (105 bits/inch)
Ie. More info (sectors) on outer tracks
Strangely enough, history reverses itself
Originally, disks were constant bit density (more
efficient)
Then, went to uniform sectors/track (simpler,
allowed easier optimization)
Returning now to constant bit density
Disk capacity follows Moores law doubles every
18 months

15
Example Seagate Barracuda

Disk for server
10 disks hence 20 surfaces
7500 cylinders, hence 750020 150000 total
tracks
237 sectors/track (average)
512 bytes/sector
Total capacity
150000 237 512 18,201,600,000 bytes
18 GB

16
Things to consider

Addressing modes
Computers always refer to data in blocks
(512bytes common)
How to address blocks?
Old school CHS (Cylinder-Head-Sector)
Computer has an idea how the drive is structured
New School LBA (Large Block Addressing)
Linear!

17
Disk Performance

Steps to read from disk
CPU tells drive controller need data from this
address
Drive decodes instruction
Move read head over desired cylinder/track (seek)
Wait for desired sector to rotate under read head
Read the data as it goes under drive head

18
Disk Performance

Components of disk performance
Seek time (to move the arm on the right cylinder)
Rotation time (on average ½ rotation) (time to
find the right sector)
Transfer time depends on rotation time
Disk controller time. Overhead to perform an
access

19
Disk Performance

So Disk Latency Queuing Time Controller time
Seek time Rotation time Transfer time

20
Seek Time

From 0 (if arm already positioned) to a maximum
15-20 ms
Note This is not a linear function of distance
(speedup coast slowdown settle)
Even when reading tracks on the same cylinder,
there is a minimal seek time (due to severe
tolerances for head positioning)
Barracuda example Average seek time 8 ms,
track to track seek time 1 ms, full disk seek
17ms

21
Rotation time

Rotation time
Seagate Barracuda 7200 RPM
(Disks these days are 3600, 4800, 5400, 7200 up
to 10800 RPM)
7200 RPM 120 RPS 8.33ms per rotation
Average rotational latency ½ worst case
rotational latency 4.17ms

22
Transfer time

Transfer time depends on rotation time, amount of
data to transfer (minimum one sector), recording
density, disk/memory connection
These days, transfer time around 2MB/s to 16MB/s

23
Disk Controller Overhead

Disk controller contains a microprocessor
buffer memory possibly a cache (for disk
sectors)
Overhead to perform an access (of the order of
1ms)
Receiving orders from CPU and interpreting them
Managing the transfer between disk and memory
(eg. Managing the DMA)
Transfer rate between disk and controller is
smaller than between disk and memory, hence
Need for a buffer in controller
This buffer might take the form of a cache
(mostly for read-ahead and write-behind)

24
Disk Time Example

Disk Parameters
Transfer size is 8K bytes
Advertised average seek is 12 ms
Disk spins at 7200 RPM
Transfer rate is 4MB/s
Controller overhead is 2ms
Assume disk is idle so no queuing delay
What is Average disk time for a sector?
avg seek avg rot delay transfer time
controller overhead
____ _____ _____ _____

25
Disk Time Example

Answer 20ms
But! Advertised seek time assumes no locality
typically ¼ to 1/3rd advertised seek time!
20ms-gt12ms
Locality is an effect of smart placement of data
by the operating system

26
My Disk

Hitachi Travelstar 7K100 60GB ATA-6 2.5in
7200RPM Mobile Hard Drive w/8MB Buffer
Interface
ATA-6 Capacity (GB)1 60 Sector size (bytes)
512 Data heads 3 Disks 2
Performance
Data buffer (MB) 8 Rotational speed (rpm)
7,200 Latency (average ms) 4.2 Media transfer
rate (Mbits/sec) 561 Max.interface transfer
rate (MB/sec) 100 Ultra DMA mode-5
16.6 PIO mode-4Command Overhead 1ms
Seek time (ms) Average 10 R / 11 W
Track to track 1 R / 1.2 W Full
stroke18 R / 19 W
Sectors per Track 414-792Max.areal density
(Gbits/sq.inch) 66
Disk to buffer data transfer 267-629 Mb/s
Buffer-host data transfer 100 MB/s

27
Some other quotes
Hard Drives Notebook Toshiba MK8026GAX 80GB,
2.5", 9.5mm, 5400 RPM, 12ms seek,
100MB/s Desktop Seagate 250GB, 7200RPM, SATA
II, 9-11ms seek Buffer to host 300MB/s Buffer
to disk 93MB/s Server Seagate Raptor SATA,
10000RPM, SATA Buffer to host 150MB/s Buffer
to disk 72MB/s

28
Next Topic

Disk Arrays
RAID!

29
Technology Trends
Disk Capacity now doubles every 18
months before 1990 every 36 months

Today Processing Power Doubles Every 18 months
Today Memory Size Doubles Every 18
months(4X/3yr)
Today Disk Capacity Doubles Every 18 months
Disk Positioning Rate (Seek Rotate) Doubles
Every Ten Years!
Caches in Memory and Device Controllers to Close
the Gap

The I/O GAP
30
Manufacturing Advantages of Disk Arrays
Disk Product Families
Conventional 4 disk designs
14
10
5.25
3.5
High End
Low End
Disk Array 1 disk design
3.5
31
Small of Large Disks ? Large of Small Disks!
IBM 3390 (K) 20 GBytes 97 cu. ft. 3 KW 15
MB/s 600 I/Os/s 250 KHrs 250K
IBM 3.5" 0061 320 MBytes 0.1 cu. ft. 11 W 1.5
MB/s 55 I/Os/s 50 KHrs 2K
3.5x70 23 GBytes 11 cu. ft. 1 KW 120
MB/s 3900 IOs/s ??? Hrs 150K
Data Capacity Volume Power Data Rate I/O
Rate MTTF Cost
large data and I/O rates high MB per cu. ft.,
high MB per KW reliability?
Disk Arrays have potential for
32
Array Reliability

Reliability of N disks Reliability of 1 Disk
N
50,000 Hours 70 disks 700 hours
Disk system MTTF Drops from 6 years to 1
month!
Arrays (without redundancy) too unreliable to
be useful!

Hot spares support reconstruction in parallel
with access very high media availability can be
achieved
33
Media Bandwidth/Latency Demands

Bandwidth requirements
High quality video
Digital data (30 frames/s) (640 x 480 pixels)
(24-b color/pixel) 221 Mb/s (27.625 MB/s)
High quality audio
Digital data (44,100 audio samples/s) (16-b
audio samples) (2 audio channels for stereo)
1.4 Mb/s (0.175 MB/s)
Compression reduces the bandwidth requirements
considerably
Latency issues
How sensitive is your eye (ear) to variations in
video (audio) rates?
How can you ensure a constant rate of delivery?
How important is synchronizing the audio and
video streams?
15 to 20 ms early to 30 to 40 ms late tolerable

34
Dependability, Reliability, Availability

Reliability a measure of the reliability
measured by the mean time to failure (MTTF).
Service interruption is measured by mean time to
repair (MTTR)
Availability a measure of service
accomplishment
Availability MTTF/(MTTF MTTR)
To increase MTTF, either improve the quality of
the components or design the system to continue
operating in the presence of faulty components
Fault avoidance preventing fault occurrence by
construction
Fault tolerance using redundancy to correct or
bypass faulty components (hardware)
Fault detection versus fault correction
Permanent faults versus transient faults

35
RAIDs Disk Arrays
Redundant Array of Inexpensive Disks

Arrays of small and inexpensive disks
Increase potential throughput by having many disk
drives
Data is spread over multiple disks
Multiple accesses are made to several disks at a
time
Reliability is lower than a single disk
But availability can be improved by adding
redundant disks (RAID)
Lost information can be reconstructed from
redundant information
MTTR mean time to repair is in the order of
hours
MTTF mean time to failure of disks is tens of
years

36
RAID Level 0 (No Redundancy Striping)
S0,b0
S0,b2
S0,b1
S0,b3
sector number
bit number

Multiple smaller disks as opposed to one big disk
Spreading the data over multiple disks striping
forces accesses to several disks in parallel
increasing the performance
Four times the throughput for a 4 disk system
Same cost as one big disk assuming 4 small
disks cost the same as one big disk
No redundancy, so what if one disk fails?
Failure of one or more disks is more likely as
the number of disks in the system increases

37
RAID Level 1 (Redundancy via Mirroring)
S0,b0
S0,b2
S0,b1
S0,b3
S0,b0
S0,b1
S0,b2
S0,b3
redundant (check) data

Uses twice as many disks as RAID 0 (e.g., 8
smaller disks with second set of 4 duplicating
the first set) so there are always two copies of
the data
Still four times the throughput
redundant disks of data disks so twice
the cost of one big disk
writes have to be made to both sets of disks, so
writes would be only 1/2 the performance of RAID
0
What if one disk fails?
If a disk fails, the system just goes to the
mirror for the data

38
RAID Level 2 (Redundancy via ECC)
S0,b0
S0,b2
S0,b1
S0,b3
1
0
0
0
1
1
1
0
3
5
6
7
4
2
1
ECC disks
ECC disks 4 and 2 point to either data disk 6 or
7,
but ECC disk 1 says disk 7 is okay,
so disk 6 must be in error

ECC disks contain the parity of data on a set of
distinct overlapping disks
Still four times the throughput
redundant disks log (total of disks) so
almost twice the cost of one big disk
writes require computing parity to write to the
ECC disks
reads require reading ECC disk and confirming
parity
Can tolerate limited disk failure, since the data
can be reconstructed

39
RAID Level 3 (Bit-Interleaved Parity)
S0,b0
S0,b2
S0,b1
S0,b3
1
0
0
1
1
parity disk
disk fails

Cost of higher availability is reduced to 1/N
where N is the number of disks in a protection
group
Still four times the throughput
redundant disks 1 of protection groups
writes require writing the new data to the data
disk as well as computing the parity, meaning
reading the other disks, so that the parity disk
can be updated
Can tolerate limited disk failure, since the data
can be reconstructed
reads require reading all the operational data
disks as well as the parity disk to calculate the
missing data that was stored on the failed disk

40
RAID Level 4 (Block-Interleaved Parity)
parity disk

Cost of higher availability still only 1/N but
the parity is stored as blocks associated with a
set of data blocks
Still four times the throughput
redundant disks 1 of protection groups
Supports small reads and small writes (reads
and writes that go to just one (or a few) data
disk in a protection group)
by watching which bits change when writing new
information, need only to change the
corresponding bits on the parity disk
the parity disk must be updated on every write,
so it is a bottleneck for back-to-back writes
Can tolerate limited disk failure, since the data
can be reconstructed

41
Block Writes

RAID 3 block writes

New data
D0
D1
D2
D3
P
?
5 writes involving all the disks
D0
D1
D2
D3
P

RAID 4 small writes

?
2 reads and 2 writes involving just two
disks
?
42
RAID Level 5 (Distributed Block-Interleaved
Parity)

Cost of higher availability still only 1/N but
the parity is spread throughout all the disks so
there is no single bottleneck for writes
Still four times the throughput
redundant disks 1 of protection groups
Supports small reads and small writes (reads
and writes that go to just one (or a few) data
disk in a protection group)
Allows multiple simultaneous writes as long as
the accompanying parity blocks are not located on
the same disk
Can tolerate limited disk failure, since the data
can be reconstructed

43
Problems of Disk Arrays Block Writes
RAID-5 Small Write Algorithm
1 Logical Write 2 Physical Reads 2 Physical
Writes
D0
D1
D2
D3
D0'
P
old data
new data
old parity
(1. Read)
(2. Read)
XOR

XOR
(3. Write)
(4. Write)
D0'
D1
D2
D3
P'
44
Distributing Parity Blocks
RAID 4
RAID 5
0 1 2 3 P0
0 1 2 3 P0
4 5 6 P1 7
4 5 6 7 P1
8 9 10 11 P2
8 9 P2 10 11
12 P3 13 14 15
12 13 14 15 P3

By distributing parity blocks to all disks, some
small writes can be performed in parallel

45
Disks Summary

Four components of disk access time
Seek Time advertised to be 3 to 14 ms but lower
in real systems
Rotational Latency 5.6 ms at 5400 RPM and 2.0
ms at 15000 RPM
Transfer Time 10 to 80 MB/s
Controller Time typically less than .2 ms
RAIDS can be used to improve availability
RAID 0 and RAID 5 widely used in servers, one
estimate is that 80 of disks in servers are
RAIDs
RAID 1 (mirroring) EMC, Tandem, IBM
RAID 3 Storage Concepts
RAID 4 Network Appliance
RAIDS have enough redundancy to allow continuous
operation

46
Computer System (Idealized)
Disk
Memory
CPU
Disk Controller
47
Next Topic

Buses

48
What is a bus?

A Bus Is
shared communication link
single set of wires used to connect multiple
subsystems
A Bus is also a fundamental tool for composing
large, complex systems
systematic means of abstraction

49
Bridge Based Bus Arch-itecture

Bridging with dual Pentium II Xeon processors on
Slot 2.
(Source http//www.intel.com.)

50
Buses
51
Advantages of Buses
I/O Device
I/O Device
I/O Device

Versatility
New devices can be added easily
Peripherals can be moved between computersystems
that use the same bus standard
Low Cost
A single set of wires is shared in multiple ways

52
Disadvantage of Buses
I/O Device
I/O Device
I/O Device

It creates a communication bottleneck
The bandwidth of that bus can limit the maximum
I/O throughput
The maximum bus speed is largely limited by
The length of the bus
The number of devices on the bus
The need to support a range of devices with
Widely varying latencies
Widely varying data transfer rates

53
The General Organization of a Bus
Control Lines
Data Lines

Control lines
Signal requests and acknowledgments
Indicate what type of information is on the data
lines
Data lines carry information between the source
and the destination
Data and Addresses
Complex commands

54
Master versus Slave
Master issues command
Bus Master
Bus Slave
Data can go either way

A bus transaction includes two parts
Issuing the command (and address) request
Transferring the data
action
Master is the one who starts the bus transaction
by
issuing the command (and address)
Slave is the one who responds to the address by
Sending data to the master if the master ask for
data
Receiving data from the master if the master
wants to send data

55
Types of Buses

Processor-Memory Bus (design specific)
Short and high speed
Only need to match the memory system
Maximize memory-to-processor bandwidth
Connects directly to the processor
Optimized for cache block transfers
I/O Bus (industry standard)
Usually is lengthy and slower
Need to match a wide range of I/O devices
Connects to the processor-memory bus or backplane
bus
Backplane Bus (standard or proprietary)
Backplane an interconnection structure within
the chassis
Allow processors, memory, and I/O devices to
coexist
Cost advantage one bus for all components

56
Example Pentium System Organization
Processor/Memory Bus -- Design Specific
Backplane Bus PCI PCI Devices Graphics
IO Control
I/O Busses IDE, USB SCSI
57
Standard Intel Pentium Read and Write Bus Cycles
58
Intel Pentium Burst Read Bus Cycle
59
A Computer System with One Bus Backplane Bus
Backplane Bus
Processor
Memory
I/O Devices

A single bus (the backplane bus) is used for
Processor to memory communication
Communication between I/O devices and memory
Advantages Simple and low cost
Disadvantages slow and the bus can become a
major bottleneck
Example IBM PC - AT

60
A Two-Bus System

I/O buses tap into the processor-memory bus via
bus adaptors to speed match between bus types
Processor-memory bus mainly for processor-memory
traffic
I/O buses provide expansion slots for I/O
devices
Apple Macintosh-II
NuBus Processor, memory, and a few selected I/O
devices
SCSI Bus the rest of the I/O devices

61
A Three-Bus System ( backside cache)

A small number of backplane buses tap into the
processor-memory bus
Processor-memory bus focus on traffic to/from
memory
I/O buses are connected to the backplane bus
Advantage loading on the processor bus is
greatly reduced busses run at different speeds

62
Main components of Intel Chipset Pentium II/III

Northbridge
Handles memory
Graphics
Southbridge I/O
PCI bus
Disk controllers
USB controllers
Audio (AC97)
Serial I/O
Interrupt controller
Timers

63
What defines a bus?
Transaction Protocol
Timing and Signaling Specification
Bunch of Wires
Electrical Specification
Physical / Mechanical Characterisics the
connectors
64
Synchronous and Asynchronous Bus

Synchronous Bus
Includes a clock in the control lines
A fixed protocol for communication that is
relative to the clock
Advantage involves very little logic and can run
very fast
Disadvantages
Every device on the bus must run at the same
clock rate
To avoid clock skew, they cannot be long if they
are fast
Asynchronous Bus
It is not clocked
It can accommodate a wide range of devices
It can be lengthened without worrying about clock
skew
It requires a handshaking protocol

65
Busses so far
Master
Slave

Control Lines
Address Lines
Data Lines
Bus Master has ability to control the bus,
initiates transaction Bus Slave module
activated by the transaction Bus Communication
Protocol specification of sequence of events
and timing requirements in transferring
information. Asynchronous Bus Transfers
control lines (req, ack) serve to orchestrate
sequencing. Synchronous Bus Transfers sequence
relative to common clock.
66
Simplest bus paradigm

All agents operate synchronously
All can source / sink data at same rate
gt simple protocol
just manage the source and target

67
Simple Synchronous Protocol
BReq
BG
R/W Address
CmdAddr
Data1
Data2
Data

Even memory busses are more complex than this
memory (slave) may take time to respond
it may need to control data rate

68
Typical Synchronous Protocol
BReq
BG
R/W Address
CmdAddr
Wait
Data1
Data2
Data1
Data

Slave indicates when it is prepared for data xfer
Actual transfer goes at bus rate

69
Asynchronous Handshake
Write Transaction
Address Data Read Req Ack
Master Asserts Address
Next Address
Master Asserts Data
t0 t1 t2 t3 t4
t5
t0 Master has obtained control and asserts
address, direction, data Waits a specified
amount of time for slaves to decode target. t1
Master asserts request line t2 Slave asserts
ack, indicating data received t3 Master
releases req t4 Slave releases ack
70
Read Transaction
Address Data Read Req Ack
Master Asserts Address
Next Address
Slave Data
t0 t1 t2 t3 t4
t5
t0 Master has obtained control and asserts
address, direction, data Waits a specified
amount of time for slaves to decode target. t1
Master asserts request line t2 Slave asserts
ack, indicating ready to transmit data t3
Master releases req, data received t4 Slave
releases ack
71
What is DMA (Direct Memory Access)?

Typical I/O devices must transfer large amounts
of data to memory of processor
Disk must transfer complete block (4K? 16K?)
Large packets from network
Regions of video frame buffer
DMA gives external device ability to write memory
directly much lower overhead than having
processor request one word at a time.
Processor (or at least memory system) acts like
slave
Issue Cache coherence
What if I/O devices write data that is currently
in processor Cache?
The processor may never see new data!
Solutions
Flush cache on every I/O operation (expensive)
Have hardware invalidate cache lines

72
Bus Transaction
Arbitration Who gets the bus Request What do
we want to do Action What happens in response
73
Arbitration Obtaining Access to the Bus
Control Master initiates requests
Bus Master
Bus Slave
Data can go either way

One of the most important issues in bus design
How is the bus reserved by a device that wishes
to use it?
Chaos is avoided by a master-slave arrangement
Only the bus master can control access to the
bus
It initiates and controls all bus requests
A slave responds to read and write requests
The simplest system
Processor is the only bus master
All bus requests must be controlled by the
processor
Major drawback the processor is involved in
every transaction

74
Multiple Potential Bus Masters the Need for
Arbitration

Bus arbitration scheme
A bus master wanting to use the bus asserts the
bus request
A bus master cannot use the bus until its request
is granted
A bus master must signal to the arbiter after
finish using the bus
Bus arbitration schemes usually try to balance
two factors
Bus priority the highest priority device should
be serviced first
Fairness Even the lowest priority device should
never be completely locked out
from the bus
Bus arbitration schemes can be divided into four
broad classes
Daisy chain arbitration
Centralized, parallel arbitration
Distributed arbitration by self-selection each
device wanting the bus places a code indicating
its identity on the bus.
Distributed arbitration by collision detection
Each device just goes for it. Problems
found after the fact.

75
The Daisy Chain Bus Arbitrations Scheme
Device 1 Highest Priority
Device N Lowest Priority
Device 2
Grant
Grant
Grant
Release
Bus Arbiter
Request
wired-OR

Advantage simple
Disadvantages
Cannot assure fairness A low-priority
device may be locked out indefinitely
The use of the daisy chain grant signal also
limits the bus speed

76
Centralized Parallel Arbitration
Device 1
Device N
Device 2
Req
Grant
Bus Arbiter

Used in essentially all processor-memory busses
and in high-speed I/O busses

77
Increasing the Bus Bandwidth

Separate versus multiplexed address and data
lines
Address and data can be transmitted in one bus
cycleif separate address and data lines are
available
Cost (a) more bus lines, (b) increased
complexity
Data bus width
By increasing the width of the data bus,
transfers of multiple words require fewer bus
cycles
Example SPARCstation 20s memory bus is 128 bit
wide
Cost more bus lines
Block transfers
Allow the bus to transfer multiple words in
back-to-back bus cycles
Only one address needs to be sent at the
beginning
The bus is not released until the last word is
transferred
Cost (a) increased complexity (b)
decreased response time for request

78
Increasing Transaction Rate on Multimaster Bus

Overlapped arbitration
perform arbitration for next transaction during
current transaction
Bus parking
master can holds onto bus and performs multiple
transactions as long as no other master makes
request
Overlapped address / data phases (prev. slide)
requires one of the above techniques
Split-phase (or packet switched) bus
completely separate address and data phases
arbitrate separately for each
address phase yield a tag which is matched with
data phase
All of the above in most modern buses

79
PCI Read/Write Transactions

All signals sampled on rising edge
Centralized Parallel Arbitration
overlapped with previous transaction
All transfers are (unlimited) bursts
Address phase starts by asserting FRAME
Next cycle initiator asserts cmd and address
Data transfers happen on when
IRDY asserted by master when ready to transfer
data
TRDY asserted by target when ready to transfer
data
transfer when both asserted on rising edge
FRAME deasserted when master intends to complete
only one more data transfer

80
PCI Read Transaction
Turn-around cycle on any signal driven by more
than one agent
81
PCI Write Transaction
82
PCI Optimizations

Push bus efficiency toward 100 under common
simple usage
like RISC
Bus Parking
retain bus grant for previous master until
another makes request
granted master can start next transfer without
arbitration
Arbitrary Burst length
initiator and target can exert flow control with
xRDY
target can disconnect request with STOP (abort or
retry)
master can disconnect by deasserting FRAME
arbiter can disconnect by deasserting GNT
Delayed (pended, split-phase) transactions
free the bus after request to slow device

83
Summary

Buses are an important technique for building
large-scale systems
Their speed is critically dependent on factors
such as length, number of devices, etc.
Critically limited by capacitance
Important terminology
Master The device that can initiate new
transactions
Slaves Devices that respond to the master
Two types of bus timing
Synchronous bus includes clock
Asynchronous no clock, just REQ/ACK strobing
Direct Memory Access (dma) allows fast, burst
transfer into processors memory
Processors memory acts like a slave
Probably requires some form of cache-coherence so
that DMAed memory can be invalidated from cache.

84
The Big Picture Where are We Now?

The Five Classic Components of a Computer
Next Topic
Locality and Memory Hierarchy
SRAM Memory Technology
DRAM Memory Technology
Memory Organization

Processor
Input
Control
Memory
Datapath
Output
85
Technology Trends
Capacity Speed (latency) Logic 2x
in 3 years 2x in 3 years DRAM 4x in 3
years 2x in 10 years Disk 4x in 3 years 2x
in 10 years
DRAM Year Size Cycle
Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1
Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145
ns 1995 64 Mb 120 ns
10001!
21!
86
Who Cares About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
Less Law?
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
87
Todays Situation Microprocessor

Rely on caches to bridge gap
Microprocessor-DRAM performance gap
time of a full cache miss in instructions
executed
1st Alpha (7000) 340 ns/5.0 ns 68 clks x 2
or 136 instructions
2nd Alpha (8400) 266 ns/3.3 ns 80 clks x 4
or 320 instructions
3rd Alpha (t.b.d.) 180 ns/1.7 ns 108 clks x 6
or 648 instructions
1/2X latency x 3X clock rate x 3X Instr/clock ?
5X

88
Cache Performance

CPU time (CPU execution clock cycles
Memory stall clock cycles) x clock cycle time
Memory stall clock cycles (Reads x Read miss
rate x Read miss penalty Writes x Write miss
rate x Write miss penalty)
Memory stall clock cycles Memory accesses x
Miss rate x Miss penalty

89
Impact on Performance

Suppose a processor executes at
Clock Rate 200 MHz (5 ns per cycle)
Base CPI 1.1
50 arith/logic, 30 ld/st, 20 control
Suppose that 10 of memory operations get 50
cycle miss penalty
Suppose that 1 of instructions get same miss
penalty
CPI Base CPI average stalls per
instruction 1.1(cycles/ins)
0.30 (DataMops/ins) x 0.10 (miss/DataMop) x
50 (cycle/miss) 1 (InstMop/ins) x 0.01
(miss/InstMop) x 50 (cycle/miss) (1.1
1.5 .5) cycle/ins 3.1
55 of the time the proc is stalled waiting for
memory!

90
The Goal illusion of large, fast, cheap memory

Fact Large memories are slow, fast memories are
small
How do we create a memory that is large, cheap
and fast (most of the time)?
Hierarchy
Parallelism

91
Why hierarchy works

The Principle of Locality
Program access a relatively small portion of the
address space at any instant of time.

92
Memory Hierarchy How Does it Work?

Temporal Locality (Locality in Time)
gt Keep most recently accessed data items closer
to the processor
Spatial Locality (Locality in Space)
gt Move blocks consists of contiguous words to
the upper levels

93
Memory Hierarchy Terminology

Hit data appears in some block in the upper
level of the hierarchy (example Block X is
found in the L1 cache)
Hit Rate the fraction of memory access found in
the upper level
Hit Time Time to access the upper level which
consists of
RAM access time Time to determine hit/miss
Miss data needs to be retrieve from a block in
the lower level in the hierachy (Block Y is not
in L1 cache and must be fetched from main memory)
Miss Rate 1 - (Hit Rate)
Miss Penalty Time to replace a block in the
upper level
Time to deliver the block the processor
Hit Time ltlt Miss Penalty

Lower Level Memory
Upper Level Memory
To Processor
Blk X
From Processor
Blk Y
94
Memory Hierarchy of a Modern Computer System

By taking advantage of the principle of locality
Present the user with as much memory as is
available in the cheapest technology.
Provide access at the speed offered by the
fastest technology.

Processor
Control
Secondary Storage (Disk)
Main Memory (DRAM)
Second Level Cache (SRAM)
On-Chip Cache
Datapath
Registers
10,000,000s (10s ms)
1s
Speed (ns)
10s
100s
10,000,000,000s (10s sec)
100s
Gs
Size (bytes)
Ks
Ms
Ts
95
How is the hierarchy managed?

Registers lt-gt Memory
by compiler (programmer?)
cache lt-gt memory
by the hardware
memory lt-gt disks
by the hardware
by the operating system (disk caches virtual
memory)
by the programmer (files)

96
Memory Hierarchy Technology

Random Access
Random is good access time is the same for all
locations
DRAM Dynamic Random Access Memory
High density, low power, cheap, slow
Dynamic need to be refreshed regularly (1-2
of cycles)
SRAM Static Random Access Memory
Low density, high power, expensive, fast
Static content will last forever(until lose
power)
Non-so-random Access Technology
Access time varies from location to location and
from time to time
Examples Disk, CDROM
Sequential Access Technology access time linear
in location (e.g.,Tape)
We will concentrate on random access technology
The Main Memory DRAMs Caches SRAMs

97
Main Memory Background

Performance of Main Memory
Latency Cache Miss Penalty
Access Time time between request and word
arrives
Cycle Time time between requests
Bandwidth I/O Large Block Miss Penalty (L2)
Main Memory is DRAM Dynamic Random Access
Memory
Dynamic since needs to be refreshed periodically
(8 ms)
Addresses divided into 2 halves (Memory as a 2D
matrix)
RAS or Row Access Strobe
CAS or Column Access Strobe
Cache uses SRAM Static Random Access Memory
No refresh (6 transistors/bit vs. 1
transistor)Size DRAM/SRAM 4-8 Cost/Cycle
time SRAM/DRAM 8-16

98
Random Access Memory (RAM) Technology

Why do computer designers need to know about RAM
technology?
Processor performance is usually limited by
memory bandwidth
As IC densities increase, lots of memory will fit
on processor chip
Tailor on-chip memory to specific needs
Instruction cache
Data cache
Write buffer
What makes RAM different from a bunch of
flip-flops?
Density RAM is much denser

99
Main Memory Deep Background

Out-of-Core, In-Core, Core Dump?
Core memory?
Non-volatile, magnetic
Lost to 4 Kbit DRAM (today using 64Mbit DRAM)
Access time 750 ns, cycle time 1500-3000 ns

100
Static RAM Cell
6-Transistor SRAM Cell
word
word (row select)
0
1
1
0
bit
bit

Write
1. Drive bit lines (bit1, bit0)
2. Select row
Read
1. Precharge bit and bit to Vdd or Vdd/2 gt make
sure equal!
2.. Select row
3. Cell pulls one line low
4. Sense amp on column detects difference between
bit and bit

bit
bit
replaced with pullup to save area
101
Typical SRAM Organization 16-word x 4-bit
Din 0
Din 1
Din 2
Din 3
WrEn
Precharge
A0
Word 0
SRAM Cell
SRAM Cell
SRAM Cell
SRAM Cell
A1
Address Decoder
A2
Word 1
SRAM Cell
SRAM Cell
SRAM Cell
SRAM Cell
A3

Word 15
SRAM Cell
SRAM Cell
SRAM Cell
SRAM Cell
Dout 0
Dout 1
Dout 2
Dout 3
102
Logic Diagram of a Typical SRAM

Write Enable is usually active low (WE_L)
Din and Dout are combined to save pins
A new control signal, output enable (OE_L) is
needed
WE_L is asserted (Low), OE_L is disasserted
(High)
D serves as the data input pin
WE_L is disasserted (High), OE_L is asserted
(Low)
D is the data output pin
Both WE_L and OE_L are asserted
Result is unknown. Dont do that!!!

103
Typical SRAM Timing
Write Timing
Read Timing
High Z
D
Data In
Data Out
Data Out
Junk
A
Write Address
Read Address
Read Address
OE_L
WE_L
Write Hold Time
Read Access Time
Read Access Time
Write Setup Time
104
1-Transistor Memory Cell (DRAM)

Write
1. Drive bit line
2.. Select row
Read
1. Precharge bit line to Vdd
2.. Select row
3. Cell and bit line share charges
Very small voltage changes on the bit line
4. Sense (fancy sense amp)
Can detect changes of 1 million electrons
5. Write restore the value
Refresh
1. Just do a dummy read to every cell.

row select
bit
105
Classical DRAM Organization (square)
bit (data) lines
r o w d e c o d e r
Each intersection represents a 1-T DRAM Cell
RAM Cell Array
word (row) select
Column Selector I/O Circuits
row address
Column Address

Row and Column Address together
Select 1 bit a time

data
106
Logic Diagram of a Typical DRAM
OE_L
WE_L
CAS_L
RAS_L
A
256K x 8 DRAM
D
9
8

Control Signals (RAS_L, CAS_L, WE_L, OE_L) are
all active low
Din and Dout are combined (D)
WE_L is asserted (Low), OE_L is disasserted
(High)
D serves as the data input pin
WE_L is disasserted (High), OE_L is asserted
(Low)
D is the data output pin
Row and column addresses share the same pins (A)
RAS_L goes low Pins A are latched in as row
address
CAS_L goes low Pins A are latched in as column
address
RAS/CAS edge-sensitive

107
DRAM Read Timing

Every DRAM access begins at
The assertion of the RAS_L
2 ways to read early or late v. CAS

DRAM Read Cycle Time
CAS_L
A
Row Address
Junk
Col Address
Row Address
Junk
Col Address
WE_L
OE_L
D
High Z
Data Out
Junk
Data Out
High Z
Read Access Time
Output Enable Delay
Early Read Cycle OE_L asserted before CAS_L
Late Read Cycle OE_L asserted after CAS_L
108
DRAM Write Timing
OE_L
WE_L
CAS_L
RAS_L

Every DRAM access begins at
The assertion of the RAS_L
2 ways to write early or late v. CAS

A
256K x 8 DRAM
D
9
8
DRAM WR Cycle Time
CAS_L
A
Row Address
Junk
Col Address
Row Address
Junk
Col Address
OE_L
WE_L
D
Junk
Junk
Data In
Data In
Junk
WR Access Time
WR Access Time
Early Wr Cycle WE_L asserted before CAS_L
Late Wr Cycle WE_L asserted after CAS_L
109
Key DRAM Timing Parameters

tRAC minimum time from RAS line falling to the
valid data output.
Quoted as the speed of a DRAM
A fast 4Mb DRAM tRAC 60 ns
tRC minimum time from the start of one row
access to the start of the next.
tRC 110 ns for a 4Mbit DRAM with a tRAC of 60
ns
tCAC minimum time from CAS line falling to valid
data output.
15 ns for a 4Mbit DRAM with a tRAC of 60 ns
tPC minimum time from the start of one column
access to the start of the next.
35 ns for a 4Mbit DRAM with a tRAC of 60 ns

110
DRAM Performance

A 60 ns (tRAC) DRAM can
perform a row access only every 110 ns (tRC)
perform column access (tCAC) in 15 ns, but time
between column accesses is at least 35 ns (tPC).
In practice, external address delays and turning
around buses make it 40 to 50 ns
These times do not include the time to drive the
addresses off the microprocessor nor the memory
controller overhead.
Drive parallel DRAMs, external memory controller,
bus to turn around, SIMM module, pins
180 ns to 250 ns latency from processor to memory
is good for a 60 ns (tRAC) DRAM

111
Main Memory Performance

Wide
CPU/Mux 1 word Mux/Cache, Bus, Memory N words
(Alpha 64 bits 256 bits)

Simple
CPU, Cache, Bus, Memory same width (32 bits)

Interleaved
CPU, Cache, Bus 1 word Memory N Modules(4
Modules) example is word interleaved

112
Main Memory Performance
Cycle Time
Access Time
Time

DRAM (Read/Write) Cycle Time gtgt DRAM
(Read/Write) Access Time
21 why?
DRAM (Read/Write) Cycle Time
How frequent can you initiate an access?
Analogy A little kid can only ask his father for
money on Saturday
DRAM (Read/Write) Access Time
How quickly will you get what you want once you
initiate an access?
Analogy As soon as he asks, his father will give
him the money
DRAM Bandwidth Limitation analogy
What happens if he runs out of money on Wednesday?

113
Increasing Bandwidth - Interleaving
Access Pattern without Interleaving
CPU
Memory
D1 available
Start Access for D1
Start Access for D2
Memory Bank 0
Access Pattern with 4-way Interleaving
Memory Bank 1
CPU
Memory Bank 2
Memory Bank 3
Access Bank 1
Access Bank 0
Access Bank 2
Access Bank 3
We can Access Bank 0 again
114
Main Memory Performance

Timing model
1 to send address,
4 for access time, 10 cycle time, 1 to send data
Cache Block is 4 words
Simple M.P. 4 x (1101) 48
Wide M.P. 1 10 1 12
Interleaved M.P. 1101 3 15

115
Independent Memory Banks

How many banks?
number banks ? number clocks to access word in
bank
For sequential accesses, otherwise will return to
original bank before it has next word ready
Increasing DRAM gt fewer chips gt harder to have
banks
Growth bits/chip DRAM 50-60/yr
Nathan Myrvold M/S mature software growth
(33/yr for NT) growth MB/ of DRAM
(25-30/yr)

116
Fewer DRAMs/System over Time
(from Pete MacWilliams, Intel)
DRAM Generation
86 89 92 96 99 02 1 Mb 4 Mb 16 Mb 64
Mb 256 Mb 1 Gb
Memory per DRAM growth _at_ 60 / year
4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB
16
4
Minimum PC Memory Size
Memory per System growth _at_ 25-30 / year
117
Fast Page Mode Operation
Column Address

Regular DRAM Organization
N rows x N column x M-bit
Read Write M-bit at a time
Each M-bit access requiresa RAS / CAS cycle
Fast Page Mode DRAM
N x M SRAM to save a row
After a row is read into the register
Only CAS is needed to access other M-bit blocks
on that row
RAS_L remains asserted while CAS_L is toggled

DRAM
Row Address
N rows
N x M SRAM
M bits
M-bit Output
118
FP Mode DRAM

Fast page mode DRAM
In page mode, a row of the DRAM can be kept
"open", so that successive reads or writes within
the row do not suffer the delay of precharge and
accessing the row. This increases the performance
of the system when reading or writing bursts of
data.

119
Key DRAM Timing Parameters

tRAC minimum time from RAS line falling to the
valid data output.
Quoted as the speed of a DRAM
A fast 4Mb DRAM tRAC 60 ns
tRC minimum time from the start of one row
access to the start of the next.
tRC 110 ns for a 4Mbit DRAM with a tRAC of 60
ns
tCAC minimum time from CAS line falling to valid
data output.
15 ns for a 4Mbit DRAM with a tRAC of 60 ns
tPC minimum time from the start of one column
access to the start of the next.
35 ns for a 4Mbit DRAM with a tRAC of 60 ns

120
SDRAM Syncronous DRAM?

More complicated, on-chip controller
Operations syncronized to clock
So, give row address one cycle
Column address some number of cycles later (say
3)
Date comes out later (say 2 cycles later)
Burst modes
Typical might be 1,2,4,8, or 256 length burst
Thus, only give RAS and CAS once for all of these
accesses
Multi-bank operation (on-chip interleaving)
Lets you overlap startup latency (5 cycles above)
of two banks
Careful of timing specs!
10ns SDRAM may still require 50ns to get first
data!
50ns DRAM means first data out in 50ns

121
Other Types of DRAM

Extended data out (EDO) DRAM
similar to Fast Page Mode DRAM
additional feature that a new access cycle can be
started while keeping the data output of the
previous cycle active. This allows a certain
amount of overlap in operation (pipelining),
allowing somewhat improved speed. It was 5
faster than Fast Page Mode DRAM, which it began
to replace in 1993.

122
Other Types of DRAM

Double data rate (DDR) SDRAM
Double data rate (DDR) SDRAM is a later
development of SDRAM, used in PC memory from 2000
onwards. All types of SDRAM use a clock signal
that is a square wave.
This means that the clock alternates regularly
between one voltage (low) and another (high),
usually millions of times per second. Plain
SDRAM, like most synchronous logic circuits, acts
on the low-to-high transition of the clock and
ignores the opposite transition. DDR SDRAM acts
on both transitions, thereby halving the required
clock rate for a given data transfer rate.

123
Memory Systems Delay more than RAW DRAM
n
address
DRAM Controller
DRAM 2n x 1 chip
Memory Timing Controller
w
Bus Drivers
Tc Tcycle Tcontroller Tdriver
124
DRAMs over Time
DRAM Generation
84 87 90 93 96 99 1 Mb 4 Mb 16 Mb
64 Mb 256 Mb 1 Gb 55 85 130 200 300 450 30 47 7
2 110 165 250 28.84 11.1 4.26 1.64 0.61 0.23
1st Gen. Sample Memory Size Die Size (mm2) Memory
Area (mm2) Memory Cell Area (µm2)
(from Kazuhiro Sakashita, Mitsubishi)
125
Summary

Two Different Types of Locality
Temporal Locality (Locality in Time) If an item
is referenced, it will tend to be referenced
again soon.
Spatial Locality (Locality in Space) If an item
is referenced, items whose addresses are close by
tend to be referenced soon.
By taking advantage of the principle of locality
Present the user with as much memory as is
available in the cheapest technology.
Provide access at the speed offered by the
fastest technology.
DRAM is slow but cheap and dense
Good choice for presenting the user with a BIG
memory system
SRAM is fast but expensive and not very dense
Good choice for providing the user FAST access
time.