Title: OMSE 510: Computing Foundations 2: Disks, Buses, DRAM
1OMSE 510 Computing Foundations2 Disks, Buses,
DRAM
-
- Portland State University/OMSE
2Outline of Comp. Architecture
Outline of the rest of the computer architecture
section Start with a description of computer
devices, work back towards the CPU.
3Computer Architecture Is
- the attributes of a computing system as seen
by the programmer, i.e., the conceptual structure
and functional behavior, as distinct from the
organization of the data flows and controls the
logic design, and the physical implementation. - Amdahl, Blaaw, and Brooks, 1964
SOFTWARE
4Today
- Begin Computer Architecture
- Disk Drives
- The Bus
- Memory
5Computer System (Idealized)
Disk
Memory
CPU
Disk Controller
6I/O Device Examples
Device Behavior Partner Data Rate
(KB/sec) Keyboard Input Human
0.01 Mouse Input Human
0.02 Line Printer Output Human
1.00 Floppy disk Storage Machine
50.00 Laser Printer Output Human
100.00 Optical Disk Storage Machine
500.00 Magnetic Disk Storage Machine
5,000.00 Network-LAN Input or Output
Machine 20 1,000.00 Graphics Display
Output Human 30,000.00
7A Device The Disk
Disk Drives! - eg. Your hard disk drive -
Where files are physically stored - Long-term
non-volatile storage device
8Magnetic Drum
9Spiral Format for Compact Disk
10A Device The Disk
Magnetic Disks - Your hard disk drive - Where
files are physically stored - Long-term
non-volatile storage device
11A Magnetic Disk with Three Platters
12Organization of a Disk Platter with a 12
Interleave Factor
13Disk Physical Characteristics
- Platters
- 1 to 20 with diameters from 1.3 to 8 inches
(Recording on both sides) - Tracks
- 2500 to 5000 Tracks/inch
- Cylinders
- all tracks in the same position in the platters
- Sectors
- 128-256 sectors/track with gaps and info related
to sectors between them (typical sector, 256-512
bytes)
14Disk Physical Characteristics
- Trend as of 2005
- Constant bit density (105 bits/inch)
- Ie. More info (sectors) on outer tracks
- Strangely enough, history reverses itself
- Originally, disks were constant bit density (more
efficient) - Then, went to uniform sectors/track (simpler,
allowed easier optimization) - Returning now to constant bit density
- Disk capacity follows Moores law doubles every
18 months
15Example Seagate Barracuda
- Disk for server
- 10 disks hence 20 surfaces
- 7500 cylinders, hence 750020 150000 total
tracks - 237 sectors/track (average)
- 512 bytes/sector
- Total capacity
- 150000 237 512 18,201,600,000 bytes
- 18 GB
16Things to consider
- Addressing modes
- Computers always refer to data in blocks
(512bytes common) - How to address blocks?
- Old school CHS (Cylinder-Head-Sector)
- Computer has an idea how the drive is structured
- New School LBA (Large Block Addressing)
- Linear!
17Disk Performance
- Steps to read from disk
- CPU tells drive controller need data from this
address - Drive decodes instruction
- Move read head over desired cylinder/track (seek)
- Wait for desired sector to rotate under read head
- Read the data as it goes under drive head
18Disk Performance
- Components of disk performance
- Seek time (to move the arm on the right cylinder)
- Rotation time (on average ½ rotation) (time to
find the right sector) - Transfer time depends on rotation time
- Disk controller time. Overhead to perform an
access
19Disk Performance
- So Disk Latency Queuing Time Controller time
Seek time Rotation time Transfer time
20Seek Time
- From 0 (if arm already positioned) to a maximum
15-20 ms - Note This is not a linear function of distance
(speedup coast slowdown settle) - Even when reading tracks on the same cylinder,
there is a minimal seek time (due to severe
tolerances for head positioning) - Barracuda example Average seek time 8 ms,
track to track seek time 1 ms, full disk seek
17ms
21Rotation time
- Rotation time
- Seagate Barracuda 7200 RPM
- (Disks these days are 3600, 4800, 5400, 7200 up
to 10800 RPM) - 7200 RPM 120 RPS 8.33ms per rotation
- Average rotational latency ½ worst case
rotational latency 4.17ms
22Transfer time
- Transfer time depends on rotation time, amount of
data to transfer (minimum one sector), recording
density, disk/memory connection - These days, transfer time around 2MB/s to 16MB/s
23Disk Controller Overhead
- Disk controller contains a microprocessor
buffer memory possibly a cache (for disk
sectors) - Overhead to perform an access (of the order of
1ms) - Receiving orders from CPU and interpreting them
- Managing the transfer between disk and memory
(eg. Managing the DMA) - Transfer rate between disk and controller is
smaller than between disk and memory, hence - Need for a buffer in controller
- This buffer might take the form of a cache
(mostly for read-ahead and write-behind)
24Disk Time Example
- Disk Parameters
- Transfer size is 8K bytes
- Advertised average seek is 12 ms
- Disk spins at 7200 RPM
- Transfer rate is 4MB/s
- Controller overhead is 2ms
- Assume disk is idle so no queuing delay
- What is Average disk time for a sector?
- avg seek avg rot delay transfer time
controller overhead - ____ _____ _____ _____
25Disk Time Example
- Answer 20ms
- But! Advertised seek time assumes no locality
typically ÂĽ to 1/3rd advertised seek time! - 20ms-gt12ms
- Locality is an effect of smart placement of data
by the operating system
26My Disk
- Hitachi Travelstar 7K100 60GB ATA-6 2.5in
- 7200RPM Mobile Hard Drive w/8MB Buffer
- Interface
- ATA-6 Capacity (GB)1 60 Sector size (bytes)
512 Data heads 3 Disks 2 - Performance
- Data buffer (MB) 8 Rotational speed (rpm)
7,200 Latency (average ms) 4.2 Media transfer
rate (Mbits/sec) 561 Max.interface transfer
rate (MB/sec) Â Â Â Â Â 100 Ultra DMA mode-5
     16.6 PIO mode-4Command Overhead 1ms - Seek time (ms)     Average 10 R / 11 W
Track to track 1 R / 1.2 W Full
stroke18 R / 19 W - Sectors per Track 414-792Max.areal density
(Gbits/sq.inch) 66 - Disk to buffer data transfer 267-629 Mb/s
- Buffer-host data transfer 100 MB/s
27Some other quotes
Hard Drives Notebook Toshiba MK8026GAX 80GB,
2.5", 9.5mm, 5400 RPM, 12ms seek,
100MB/s Desktop Seagate 250GB, 7200RPM, SATA
II, 9-11ms seek Buffer to host 300MB/s Buffer
to disk 93MB/s Server Seagate Raptor SATA,
10000RPM, SATA Buffer to host 150MB/s Buffer
to disk 72MB/s
28Next Topic
29Technology Trends
Disk Capacity now doubles every 18
months before 1990 every 36 months
- Today Processing Power Doubles Every 18 months
- Â Today Memory Size Doubles Every 18
months(4X/3yr) - Â Today Disk Capacity Doubles Every 18 months
- Â Disk Positioning Rate (Seek Rotate) Doubles
Every Ten Years! - Caches in Memory and Device Controllers to Close
the Gap
The I/O GAP
30Manufacturing Advantages of Disk Arrays
Disk Product Families
Conventional 4 disk designs
14
10
5.25
3.5
High End
Low End
Disk Array 1 disk design
3.5
31Small of Large Disks ? Large of Small Disks!
IBM 3390 (K) 20 GBytes 97 cu. ft. 3 KW 15
MB/s 600 I/Os/s 250 KHrs 250K
IBM 3.5" 0061 320 MBytes 0.1 cu. ft. 11 W 1.5
MB/s 55 I/Os/s 50 KHrs 2K
3.5x70 23 GBytes 11 cu. ft. 1 KW 120
MB/s 3900 IOs/s ??? Hrs 150K
Data Capacity Volume Power Data Rate I/O
Rate MTTF Cost
large data and I/O rates high MB per cu. ft.,
high MB per KW reliability?
Disk Arrays have potential for
32Array Reliability
- Reliability of N disks Reliability of 1 Disk
N - 50,000 Hours 70 disks 700 hours
- Disk system MTTF Drops from 6 years to 1
month! - Arrays (without redundancy) too unreliable to
be useful!
Hot spares support reconstruction in parallel
with access very high media availability can be
achieved
33Media Bandwidth/Latency Demands
- Bandwidth requirements
- High quality video
- Digital data (30 frames/s) (640 x 480 pixels)
(24-b color/pixel) 221 Mb/s (27.625 MB/s) - High quality audio
- Digital data (44,100 audio samples/s) (16-b
audio samples) (2 audio channels for stereo)
1.4 Mb/s (0.175 MB/s) - Compression reduces the bandwidth requirements
considerably - Latency issues
- How sensitive is your eye (ear) to variations in
video (audio) rates? - How can you ensure a constant rate of delivery?
- How important is synchronizing the audio and
video streams? - 15 to 20 ms early to 30 to 40 ms late tolerable
34Dependability, Reliability, Availability
- Reliability a measure of the reliability
measured by the mean time to failure (MTTF).
Service interruption is measured by mean time to
repair (MTTR) - Availability a measure of service
accomplishment - Availability MTTF/(MTTF MTTR)
- To increase MTTF, either improve the quality of
the components or design the system to continue
operating in the presence of faulty components - Fault avoidance preventing fault occurrence by
construction - Fault tolerance using redundancy to correct or
bypass faulty components (hardware) - Fault detection versus fault correction
- Permanent faults versus transient faults
35RAIDs Disk Arrays
Redundant Array of Inexpensive Disks
- Arrays of small and inexpensive disks
- Increase potential throughput by having many disk
drives - Data is spread over multiple disks
- Multiple accesses are made to several disks at a
time - Reliability is lower than a single disk
- But availability can be improved by adding
redundant disks (RAID) - Lost information can be reconstructed from
redundant information - MTTR mean time to repair is in the order of
hours - MTTF mean time to failure of disks is tens of
years
36RAID Level 0 (No Redundancy Striping)
S0,b0
S0,b2
S0,b1
S0,b3
sector number
bit number
- Multiple smaller disks as opposed to one big disk
- Spreading the data over multiple disks striping
forces accesses to several disks in parallel
increasing the performance - Four times the throughput for a 4 disk system
- Same cost as one big disk assuming 4 small
disks cost the same as one big disk - No redundancy, so what if one disk fails?
- Failure of one or more disks is more likely as
the number of disks in the system increases
37RAID Level 1 (Redundancy via Mirroring)
S0,b0
S0,b2
S0,b1
S0,b3
S0,b0
S0,b1
S0,b2
S0,b3
redundant (check) data
- Uses twice as many disks as RAID 0 (e.g., 8
smaller disks with second set of 4 duplicating
the first set) so there are always two copies of
the data - Still four times the throughput
- redundant disks of data disks so twice
the cost of one big disk - writes have to be made to both sets of disks, so
writes would be only 1/2 the performance of RAID
0 - What if one disk fails?
- If a disk fails, the system just goes to the
mirror for the data
38RAID Level 2 (Redundancy via ECC)
S0,b0
S0,b2
S0,b1
S0,b3
1
0
0
0
1
1
1
0
3
5
6
7
4
2
1
ECC disks
ECC disks 4 and 2 point to either data disk 6 or
7,
but ECC disk 1 says disk 7 is okay,
so disk 6 must be in error
- ECC disks contain the parity of data on a set of
distinct overlapping disks - Still four times the throughput
- redundant disks log (total of disks) so
almost twice the cost of one big disk - writes require computing parity to write to the
ECC disks - reads require reading ECC disk and confirming
parity - Can tolerate limited disk failure, since the data
can be reconstructed
39RAID Level 3 (Bit-Interleaved Parity)
S0,b0
S0,b2
S0,b1
S0,b3
1
0
0
1
1
parity disk
disk fails
- Cost of higher availability is reduced to 1/N
where N is the number of disks in a protection
group - Still four times the throughput
- redundant disks 1 of protection groups
- writes require writing the new data to the data
disk as well as computing the parity, meaning
reading the other disks, so that the parity disk
can be updated - Can tolerate limited disk failure, since the data
can be reconstructed - reads require reading all the operational data
disks as well as the parity disk to calculate the
missing data that was stored on the failed disk
40RAID Level 4 (Block-Interleaved Parity)
parity disk
- Cost of higher availability still only 1/N but
the parity is stored as blocks associated with a
set of data blocks - Still four times the throughput
- redundant disks 1 of protection groups
- Supports small reads and small writes (reads
and writes that go to just one (or a few) data
disk in a protection group) - by watching which bits change when writing new
information, need only to change the
corresponding bits on the parity disk - the parity disk must be updated on every write,
so it is a bottleneck for back-to-back writes - Can tolerate limited disk failure, since the data
can be reconstructed
41Block Writes
New data
D0
D1
D2
D3
P
?
5 writes involving all the disks
D0
D1
D2
D3
P
?
2 reads and 2 writes involving just two
disks
?
42RAID Level 5 (Distributed Block-Interleaved
Parity)
- Cost of higher availability still only 1/N but
the parity is spread throughout all the disks so
there is no single bottleneck for writes - Still four times the throughput
- redundant disks 1 of protection groups
- Supports small reads and small writes (reads
and writes that go to just one (or a few) data
disk in a protection group) - Allows multiple simultaneous writes as long as
the accompanying parity blocks are not located on
the same disk - Can tolerate limited disk failure, since the data
can be reconstructed
43Problems of Disk Arrays Block Writes
RAID-5 Small Write Algorithm
1 Logical Write 2 Physical Reads 2 Physical
Writes
D0
D1
D2
D3
D0'
P
old data
new data
old parity
(1. Read)
(2. Read)
XOR
XOR
(3. Write)
(4. Write)
D0'
D1
D2
D3
P'
44Distributing Parity Blocks
RAID 4
RAID 5
0 1 2 3 P0
0 1 2 3 P0
4 5 6 P1 7
4 5 6 7 P1
8 9 10 11 P2
8 9 P2 10 11
12 P3 13 14 15
12 13 14 15 P3
- By distributing parity blocks to all disks, some
small writes can be performed in parallel
45Disks Summary
- Four components of disk access time
- Seek Time advertised to be 3 to 14 ms but lower
in real systems - Rotational Latency 5.6 ms at 5400 RPM and 2.0
ms at 15000 RPM - Transfer Time 10 to 80 MB/s
- Controller Time typically less than .2 ms
- RAIDS can be used to improve availability
- RAID 0 and RAID 5 widely used in servers, one
estimate is that 80 of disks in servers are
RAIDs - RAID 1 (mirroring) EMC, Tandem, IBM
- RAID 3 Storage Concepts
- RAID 4 Network Appliance
- RAIDS have enough redundancy to allow continuous
operation
46Computer System (Idealized)
Disk
Memory
CPU
Disk Controller
47Next Topic
48What is a bus?
- A Bus Is
- shared communication link
- single set of wires used to connect multiple
subsystems - A Bus is also a fundamental tool for composing
large, complex systems - systematic means of abstraction
49Bridge Based Bus Arch-itecture
- Bridging with dual Pentium II Xeon processors on
Slot 2. - (Source http//www.intel.com.)
50Buses
51Advantages of Buses
I/O Device
I/O Device
I/O Device
- Versatility
- New devices can be added easily
- Peripherals can be moved between computersystems
that use the same bus standard - Low Cost
- A single set of wires is shared in multiple ways
52Disadvantage of Buses
I/O Device
I/O Device
I/O Device
- It creates a communication bottleneck
- The bandwidth of that bus can limit the maximum
I/O throughput - The maximum bus speed is largely limited by
- The length of the bus
- The number of devices on the bus
- The need to support a range of devices with
- Widely varying latencies
- Widely varying data transfer rates
53The General Organization of a Bus
Control Lines
Data Lines
- Control lines
- Signal requests and acknowledgments
- Indicate what type of information is on the data
lines - Data lines carry information between the source
and the destination - Data and Addresses
- Complex commands
54Master versus Slave
Master issues command
Bus Master
Bus Slave
Data can go either way
- A bus transaction includes two parts
- Issuing the command (and address) request
- Transferring the data
action - Master is the one who starts the bus transaction
by - issuing the command (and address)
- Slave is the one who responds to the address by
- Sending data to the master if the master ask for
data - Receiving data from the master if the master
wants to send data
55Types of Buses
- Processor-Memory Bus (design specific)
- Short and high speed
- Only need to match the memory system
- Maximize memory-to-processor bandwidth
- Connects directly to the processor
- Optimized for cache block transfers
- I/O Bus (industry standard)
- Usually is lengthy and slower
- Need to match a wide range of I/O devices
- Connects to the processor-memory bus or backplane
bus - Backplane Bus (standard or proprietary)
- Backplane an interconnection structure within
the chassis - Allow processors, memory, and I/O devices to
coexist - Cost advantage one bus for all components
56Example Pentium System Organization
Processor/Memory Bus -- Design Specific
Backplane Bus PCI PCI Devices Graphics
IO Control
I/O Busses IDE, USB SCSI
57Standard Intel Pentium Read and Write Bus Cycles
58Intel Pentium Burst Read Bus Cycle
59A Computer System with One Bus Backplane Bus
Backplane Bus
Processor
Memory
I/O Devices
- A single bus (the backplane bus) is used for
- Processor to memory communication
- Communication between I/O devices and memory
- Advantages Simple and low cost
- Disadvantages slow and the bus can become a
major bottleneck - Example IBM PC - AT
60A Two-Bus System
- I/O buses tap into the processor-memory bus via
bus adaptors to speed match between bus types - Processor-memory bus mainly for processor-memory
traffic - I/O buses provide expansion slots for I/O
devices - Apple Macintosh-II
- NuBus Processor, memory, and a few selected I/O
devices - SCSI Bus the rest of the I/O devices
61A Three-Bus System ( backside cache)
- A small number of backplane buses tap into the
processor-memory bus - Processor-memory bus focus on traffic to/from
memory - I/O buses are connected to the backplane bus
- Advantage loading on the processor bus is
greatly reduced busses run at different speeds
62Main components of Intel Chipset Pentium II/III
- Northbridge
- Handles memory
- Graphics
- Southbridge I/O
- PCI bus
- Disk controllers
- USB controllers
- Audio (AC97)
- Serial I/O
- Interrupt controller
- Timers
63What defines a bus?
Transaction Protocol
Timing and Signaling Specification
Bunch of Wires
Electrical Specification
Physical / Mechanical Characterisics the
connectors
64Synchronous and Asynchronous Bus
- Synchronous Bus
- Includes a clock in the control lines
- A fixed protocol for communication that is
relative to the clock - Advantage involves very little logic and can run
very fast - Disadvantages
- Every device on the bus must run at the same
clock rate - To avoid clock skew, they cannot be long if they
are fast - Asynchronous Bus
- It is not clocked
- It can accommodate a wide range of devices
- It can be lengthened without worrying about clock
skew - It requires a handshaking protocol
65Busses so far
Master
Slave
Control Lines
Address Lines
Data Lines
Bus Master has ability to control the bus,
initiates transaction Bus Slave module
activated by the transaction Bus Communication
Protocol specification of sequence of events
and timing requirements in transferring
information. Asynchronous Bus Transfers
control lines (req, ack) serve to orchestrate
sequencing. Synchronous Bus Transfers sequence
relative to common clock.
66Simplest bus paradigm
- All agents operate synchronously
- All can source / sink data at same rate
- gt simple protocol
- just manage the source and target
67Simple Synchronous Protocol
BReq
BG
R/W Address
CmdAddr
Data1
Data2
Data
- Even memory busses are more complex than this
- memory (slave) may take time to respond
- it may need to control data rate
68Typical Synchronous Protocol
BReq
BG
R/W Address
CmdAddr
Wait
Data1
Data2
Data1
Data
- Slave indicates when it is prepared for data xfer
- Actual transfer goes at bus rate
69Asynchronous Handshake
Write Transaction
Address Data Read Req Ack
Master Asserts Address
Next Address
Master Asserts Data
t0 t1 t2 t3 t4
t5
t0 Master has obtained control and asserts
address, direction, data Waits a specified
amount of time for slaves to decode target. t1
Master asserts request line t2 Slave asserts
ack, indicating data received t3 Master
releases req t4 Slave releases ack
70Read Transaction
Address Data Read Req Ack
Master Asserts Address
Next Address
Slave Data
t0 t1 t2 t3 t4
t5
t0 Master has obtained control and asserts
address, direction, data Waits a specified
amount of time for slaves to decode target. t1
Master asserts request line t2 Slave asserts
ack, indicating ready to transmit data t3
Master releases req, data received t4 Slave
releases ack
71What is DMA (Direct Memory Access)?
- Typical I/O devices must transfer large amounts
of data to memory of processor - Disk must transfer complete block (4K? 16K?)
- Large packets from network
- Regions of video frame buffer
- DMA gives external device ability to write memory
directly much lower overhead than having
processor request one word at a time. - Processor (or at least memory system) acts like
slave - Issue Cache coherence
- What if I/O devices write data that is currently
in processor Cache? - The processor may never see new data!
- Solutions
- Flush cache on every I/O operation (expensive)
- Have hardware invalidate cache lines
72Bus Transaction
Arbitration Who gets the bus Request What do
we want to do Action What happens in response
73Arbitration Obtaining Access to the Bus
Control Master initiates requests
Bus Master
Bus Slave
Data can go either way
- One of the most important issues in bus design
- How is the bus reserved by a device that wishes
to use it? - Chaos is avoided by a master-slave arrangement
- Only the bus master can control access to the
bus - It initiates and controls all bus requests
- A slave responds to read and write requests
- The simplest system
- Processor is the only bus master
- All bus requests must be controlled by the
processor - Major drawback the processor is involved in
every transaction
74Multiple Potential Bus Masters the Need for
Arbitration
- Bus arbitration scheme
- A bus master wanting to use the bus asserts the
bus request - A bus master cannot use the bus until its request
is granted - A bus master must signal to the arbiter after
finish using the bus - Bus arbitration schemes usually try to balance
two factors - Bus priority the highest priority device should
be serviced first - Fairness Even the lowest priority device should
never be completely locked out
from the bus - Bus arbitration schemes can be divided into four
broad classes - Daisy chain arbitration
- Centralized, parallel arbitration
- Distributed arbitration by self-selection each
device wanting the bus places a code indicating
its identity on the bus. - Distributed arbitration by collision detection
Each device just goes for it. Problems
found after the fact.
75The Daisy Chain Bus Arbitrations Scheme
Device 1 Highest Priority
Device N Lowest Priority
Device 2
Grant
Grant
Grant
Release
Bus Arbiter
Request
wired-OR
- Advantage simple
- Disadvantages
- Cannot assure fairness A low-priority
device may be locked out indefinitely - The use of the daisy chain grant signal also
limits the bus speed
76Centralized Parallel Arbitration
Device 1
Device N
Device 2
Req
Grant
Bus Arbiter
- Used in essentially all processor-memory busses
and in high-speed I/O busses
77Increasing the Bus Bandwidth
- Separate versus multiplexed address and data
lines - Address and data can be transmitted in one bus
cycleif separate address and data lines are
available - Cost (a) more bus lines, (b) increased
complexity - Data bus width
- By increasing the width of the data bus,
transfers of multiple words require fewer bus
cycles - Example SPARCstation 20s memory bus is 128 bit
wide - Cost more bus lines
- Block transfers
- Allow the bus to transfer multiple words in
back-to-back bus cycles - Only one address needs to be sent at the
beginning - The bus is not released until the last word is
transferred - Cost (a) increased complexity (b)
decreased response time for request
78Increasing Transaction Rate on Multimaster Bus
- Overlapped arbitration
- perform arbitration for next transaction during
current transaction - Bus parking
- master can holds onto bus and performs multiple
transactions as long as no other master makes
request - Overlapped address / data phases (prev. slide)
- requires one of the above techniques
- Split-phase (or packet switched) bus
- completely separate address and data phases
- arbitrate separately for each
- address phase yield a tag which is matched with
data phase - All of the above in most modern buses
79PCI Read/Write Transactions
- All signals sampled on rising edge
- Centralized Parallel Arbitration
- overlapped with previous transaction
- All transfers are (unlimited) bursts
- Address phase starts by asserting FRAME
- Next cycle initiator asserts cmd and address
- Data transfers happen on when
- IRDY asserted by master when ready to transfer
data - TRDY asserted by target when ready to transfer
data - transfer when both asserted on rising edge
- FRAME deasserted when master intends to complete
only one more data transfer
80PCI Read Transaction
Turn-around cycle on any signal driven by more
than one agent
81PCI Write Transaction
82PCI Optimizations
- Push bus efficiency toward 100 under common
simple usage - like RISC
- Bus Parking
- retain bus grant for previous master until
another makes request - granted master can start next transfer without
arbitration - Arbitrary Burst length
- initiator and target can exert flow control with
xRDY - target can disconnect request with STOP (abort or
retry) - master can disconnect by deasserting FRAME
- arbiter can disconnect by deasserting GNT
- Delayed (pended, split-phase) transactions
- free the bus after request to slow device
83Summary
- Buses are an important technique for building
large-scale systems - Their speed is critically dependent on factors
such as length, number of devices, etc. - Critically limited by capacitance
- Important terminology
- Master The device that can initiate new
transactions - Slaves Devices that respond to the master
- Two types of bus timing
- Synchronous bus includes clock
- Asynchronous no clock, just REQ/ACK strobing
- Direct Memory Access (dma) allows fast, burst
transfer into processors memory - Processors memory acts like a slave
- Probably requires some form of cache-coherence so
that DMAed memory can be invalidated from cache.
84The Big Picture Where are We Now?
- The Five Classic Components of a Computer
- Next Topic
- Locality and Memory Hierarchy
- SRAM Memory Technology
- DRAM Memory Technology
- Memory Organization
Processor
Input
Control
Memory
Datapath
Output
85Technology Trends
Capacity Speed (latency) Logic 2x
in 3 years 2x in 3 years DRAM 4x in 3
years 2x in 10 years Disk 4x in 3 years 2x
in 10 years
DRAM Year Size Cycle
Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1
Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145
ns 1995 64 Mb 120 ns
10001!
21!
86Who Cares About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
Less Law?
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
87Todays Situation Microprocessor
- Rely on caches to bridge gap
- Microprocessor-DRAM performance gap
- time of a full cache miss in instructions
executed - 1st Alpha (7000) 340 ns/5.0 ns  68 clks x 2
or 136 instructions - 2nd Alpha (8400) 266 ns/3.3 ns  80 clks x 4
or 320 instructions - 3rd Alpha (t.b.d.) 180 ns/1.7 ns 108 clks x 6
or 648 instructions - 1/2X latency x 3X clock rate x 3X Instr/clock ?
5X
88Cache Performance
- CPU time (CPU execution clock cycles
Memory stall clock cycles) x clock cycle time - Memory stall clock cycles (Reads x Read miss
rate x Read miss penalty Writes x Write miss
rate x Write miss penalty) - Memory stall clock cycles Memory accesses x
Miss rate x Miss penalty
89Impact on Performance
- Suppose a processor executes at
- Clock Rate 200 MHz (5 ns per cycle)
- Base CPI 1.1
- 50 arith/logic, 30 ld/st, 20 control
- Suppose that 10 of memory operations get 50
cycle miss penalty - Suppose that 1 of instructions get same miss
penalty - CPI Base CPI average stalls per
instruction 1.1(cycles/ins)
0.30 (DataMops/ins) x 0.10 (miss/DataMop) x
50 (cycle/miss) 1 (InstMop/ins) x 0.01
(miss/InstMop) x 50 (cycle/miss) (1.1
1.5 .5) cycle/ins 3.1 - 55 of the time the proc is stalled waiting for
memory!
90The Goal illusion of large, fast, cheap memory
- Fact Large memories are slow, fast memories are
small - How do we create a memory that is large, cheap
and fast (most of the time)? - Hierarchy
- Parallelism
91Why hierarchy works
- The Principle of Locality
- Program access a relatively small portion of the
address space at any instant of time.
92Memory Hierarchy How Does it Work?
- Temporal Locality (Locality in Time)
- gt Keep most recently accessed data items closer
to the processor - Spatial Locality (Locality in Space)
- gt Move blocks consists of contiguous words to
the upper levels
93Memory Hierarchy Terminology
- Hit data appears in some block in the upper
level of the hierarchy (example Block X is
found in the L1 cache) - Hit Rate the fraction of memory access found in
the upper level - Hit Time Time to access the upper level which
consists of - RAM access time Time to determine hit/miss
- Miss data needs to be retrieve from a block in
the lower level in the hierachy (Block Y is not
in L1 cache and must be fetched from main memory) - Miss Rate 1 - (Hit Rate)
- Miss Penalty Time to replace a block in the
upper level - Time to deliver the block the processor
- Hit Time ltlt Miss Penalty
Lower Level Memory
Upper Level Memory
To Processor
Blk X
From Processor
Blk Y
94Memory Hierarchy of a Modern Computer System
- By taking advantage of the principle of locality
- Present the user with as much memory as is
available in the cheapest technology. - Provide access at the speed offered by the
fastest technology.
Processor
Control
Secondary Storage (Disk)
Main Memory (DRAM)
Second Level Cache (SRAM)
On-Chip Cache
Datapath
Registers
10,000,000s (10s ms)
1s
Speed (ns)
10s
100s
10,000,000,000s (10s sec)
100s
Gs
Size (bytes)
Ks
Ms
Ts
95How is the hierarchy managed?
- Registers lt-gt Memory
- by compiler (programmer?)
- cache lt-gt memory
- by the hardware
- memory lt-gt disks
- by the hardware
- by the operating system (disk caches virtual
memory) - by the programmer (files)
96Memory Hierarchy Technology
- Random Access
- Random is good access time is the same for all
locations - DRAM Dynamic Random Access Memory
- High density, low power, cheap, slow
- Dynamic need to be refreshed regularly (1-2
of cycles) - SRAM Static Random Access Memory
- Low density, high power, expensive, fast
- Static content will last forever(until lose
power) - Non-so-random Access Technology
- Access time varies from location to location and
from time to time - Examples Disk, CDROM
- Sequential Access Technology access time linear
in location (e.g.,Tape) - We will concentrate on random access technology
- The Main Memory DRAMs Caches SRAMs
97Main Memory Background
- Performance of Main Memory
- Latency Cache Miss Penalty
- Access Time time between request and word
arrives - Cycle Time time between requests
- Bandwidth I/O Large Block Miss Penalty (L2)
- Main Memory is DRAM Dynamic Random Access
Memory - Dynamic since needs to be refreshed periodically
(8 ms) - Addresses divided into 2 halves (Memory as a 2D
matrix) - RAS or Row Access Strobe
- CAS or Column Access Strobe
- Cache uses SRAM Static Random Access Memory
- No refresh (6 transistors/bit vs. 1
transistor)Size DRAM/SRAM 4-8 Cost/Cycle
time SRAM/DRAM 8-16
98Random Access Memory (RAM) Technology
- Why do computer designers need to know about RAM
technology? - Processor performance is usually limited by
memory bandwidth - As IC densities increase, lots of memory will fit
on processor chip - Tailor on-chip memory to specific needs
- Instruction cache
- Data cache
- Write buffer
- What makes RAM different from a bunch of
flip-flops? - Density RAM is much denser
99Main Memory Deep Background
- Out-of-Core, In-Core, Core Dump?
- Core memory?
- Non-volatile, magnetic
- Lost to 4 Kbit DRAM (today using 64Mbit DRAM)
- Access time 750 ns, cycle time 1500-3000 ns
100Static RAM Cell
6-Transistor SRAM Cell
word
word (row select)
0
1
1
0
bit
bit
- Write
- 1. Drive bit lines (bit1, bit0)
- 2. Select row
- Read
- 1. Precharge bit and bit to Vdd or Vdd/2 gt make
sure equal! - 2.. Select row
- 3. Cell pulls one line low
- 4. Sense amp on column detects difference between
bit and bit
bit
bit
replaced with pullup to save area
101Typical SRAM Organization 16-word x 4-bit
Din 0
Din 1
Din 2
Din 3
WrEn
Precharge
A0
Word 0
SRAM Cell
SRAM Cell
SRAM Cell
SRAM Cell
A1
Address Decoder
A2
Word 1
SRAM Cell
SRAM Cell
SRAM Cell
SRAM Cell
A3
Word 15
SRAM Cell
SRAM Cell
SRAM Cell
SRAM Cell
Dout 0
Dout 1
Dout 2
Dout 3
102Logic Diagram of a Typical SRAM
- Write Enable is usually active low (WE_L)
- Din and Dout are combined to save pins
- A new control signal, output enable (OE_L) is
needed - WE_L is asserted (Low), OE_L is disasserted
(High) - D serves as the data input pin
- WE_L is disasserted (High), OE_L is asserted
(Low) - D is the data output pin
- Both WE_L and OE_L are asserted
- Result is unknown. Dont do that!!!
103Typical SRAM Timing
Write Timing
Read Timing
High Z
D
Data In
Data Out
Data Out
Junk
A
Write Address
Read Address
Read Address
OE_L
WE_L
Write Hold Time
Read Access Time
Read Access Time
Write Setup Time
1041-Transistor Memory Cell (DRAM)
- Write
- 1. Drive bit line
- 2.. Select row
- Read
- 1. Precharge bit line to Vdd
- 2.. Select row
- 3. Cell and bit line share charges
- Very small voltage changes on the bit line
- 4. Sense (fancy sense amp)
- Can detect changes of 1 million electrons
- 5. Write restore the value
- Refresh
- 1. Just do a dummy read to every cell.
row select
bit
105Classical DRAM Organization (square)
bit (data) lines
r o w d e c o d e r
Each intersection represents a 1-T DRAM Cell
RAM Cell Array
word (row) select
Column Selector I/O Circuits
row address
Column Address
- Row and Column Address together
- Select 1 bit a time
data
106Logic Diagram of a Typical DRAM
OE_L
WE_L
CAS_L
RAS_L
A
256K x 8 DRAM
D
9
8
- Control Signals (RAS_L, CAS_L, WE_L, OE_L) are
all active low - Din and Dout are combined (D)
- WE_L is asserted (Low), OE_L is disasserted
(High) - D serves as the data input pin
- WE_L is disasserted (High), OE_L is asserted
(Low) - D is the data output pin
- Row and column addresses share the same pins (A)
- RAS_L goes low Pins A are latched in as row
address - CAS_L goes low Pins A are latched in as column
address - RAS/CAS edge-sensitive
107DRAM Read Timing
- Every DRAM access begins at
- The assertion of the RAS_L
- 2 ways to read early or late v. CAS
DRAM Read Cycle Time
CAS_L
A
Row Address
Junk
Col Address
Row Address
Junk
Col Address
WE_L
OE_L
D
High Z
Data Out
Junk
Data Out
High Z
Read Access Time
Output Enable Delay
Early Read Cycle OE_L asserted before CAS_L
Late Read Cycle OE_L asserted after CAS_L
108DRAM Write Timing
OE_L
WE_L
CAS_L
RAS_L
- Every DRAM access begins at
- The assertion of the RAS_L
- 2 ways to write early or late v. CAS
A
256K x 8 DRAM
D
9
8
DRAM WR Cycle Time
CAS_L
A
Row Address
Junk
Col Address
Row Address
Junk
Col Address
OE_L
WE_L
D
Junk
Junk
Data In
Data In
Junk
WR Access Time
WR Access Time
Early Wr Cycle WE_L asserted before CAS_L
Late Wr Cycle WE_L asserted after CAS_L
109Key DRAM Timing Parameters
- tRAC minimum time from RAS line falling to the
valid data output. - Quoted as the speed of a DRAM
- A fast 4Mb DRAM tRAC 60 ns
- tRC minimum time from the start of one row
access to the start of the next. - tRC 110 ns for a 4Mbit DRAM with a tRAC of 60
ns - tCAC minimum time from CAS line falling to valid
data output. - 15 ns for a 4Mbit DRAM with a tRAC of 60 ns
- tPC minimum time from the start of one column
access to the start of the next. - 35 ns for a 4Mbit DRAM with a tRAC of 60 ns
110DRAM Performance
- A 60 ns (tRAC) DRAM can
- perform a row access only every 110 ns (tRC)
- perform column access (tCAC) in 15 ns, but time
between column accesses is at least 35 ns (tPC). - In practice, external address delays and turning
around buses make it 40 to 50 ns - These times do not include the time to drive the
addresses off the microprocessor nor the memory
controller overhead. - Drive parallel DRAMs, external memory controller,
bus to turn around, SIMM module, pins - 180 ns to 250 ns latency from processor to memory
is good for a 60 ns (tRAC) DRAM
111Main Memory Performance
- Wide
- CPU/Mux 1 word Mux/Cache, Bus, Memory N words
(Alpha 64 bits 256 bits)
- Simple
- CPU, Cache, Bus, Memory same width (32 bits)
- Interleaved
- CPU, Cache, Bus 1 word Memory N Modules(4
Modules) example is word interleaved
112Main Memory Performance
Cycle Time
Access Time
Time
- DRAM (Read/Write) Cycle Time gtgt DRAM
(Read/Write) Access Time - 21 why?
- DRAM (Read/Write) Cycle Time
- How frequent can you initiate an access?
- Analogy A little kid can only ask his father for
money on Saturday - DRAM (Read/Write) Access Time
- How quickly will you get what you want once you
initiate an access? - Analogy As soon as he asks, his father will give
him the money - DRAM Bandwidth Limitation analogy
- What happens if he runs out of money on Wednesday?
113Increasing Bandwidth - Interleaving
Access Pattern without Interleaving
CPU
Memory
D1 available
Start Access for D1
Start Access for D2
Memory Bank 0
Access Pattern with 4-way Interleaving
Memory Bank 1
CPU
Memory Bank 2
Memory Bank 3
Access Bank 1
Access Bank 0
Access Bank 2
Access Bank 3
We can Access Bank 0 again
114Main Memory Performance
- Timing model
- 1 to send address,
- 4 for access time, 10 cycle time, 1 to send data
- Cache Block is 4 words
- Simple M.P. 4 x (1101) 48
- Wide M.P. 1 10 1 12
- Interleaved M.P. 1101 3 15
115Independent Memory Banks
- How many banks?
- number banks ? number clocks to access word in
bank - For sequential accesses, otherwise will return to
original bank before it has next word ready - Increasing DRAM gt fewer chips gt harder to have
banks - Growth bits/chip DRAM 50-60/yr
- Nathan Myrvold M/S mature software growth
(33/yr for NT) growth MB/ of DRAM
(25-30/yr)
116Fewer DRAMs/System over Time
(from Pete MacWilliams, Intel)
DRAM Generation
86 89 92 96 99 02 1 Mb 4 Mb 16 Mb 64
Mb 256 Mb 1 Gb
Memory per DRAM growth _at_ 60 / year
4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB
16
4
Minimum PC Memory Size
Memory per System growth _at_ 25-30 / year
117Fast Page Mode Operation
Column Address
- Regular DRAM Organization
- N rows x N column x M-bit
- Read Write M-bit at a time
- Each M-bit access requiresa RAS / CAS cycle
- Fast Page Mode DRAM
- N x M SRAM to save a row
- After a row is read into the register
- Only CAS is needed to access other M-bit blocks
on that row - RAS_L remains asserted while CAS_L is toggled
DRAM
Row Address
N rows
N x M SRAM
M bits
M-bit Output
118FP Mode DRAM
- Fast page mode DRAM
- In page mode, a row of the DRAM can be kept
"open", so that successive reads or writes within
the row do not suffer the delay of precharge and
accessing the row. This increases the performance
of the system when reading or writing bursts of
data.
119Key DRAM Timing Parameters
- tRAC minimum time from RAS line falling to the
valid data output. - Quoted as the speed of a DRAM
- A fast 4Mb DRAM tRAC 60 ns
- tRC minimum time from the start of one row
access to the start of the next. - tRC 110 ns for a 4Mbit DRAM with a tRAC of 60
ns - tCAC minimum time from CAS line falling to valid
data output. - 15 ns for a 4Mbit DRAM with a tRAC of 60 ns
- tPC minimum time from the start of one column
access to the start of the next. - 35 ns for a 4Mbit DRAM with a tRAC of 60 ns
120SDRAM Syncronous DRAM?
- More complicated, on-chip controller
- Operations syncronized to clock
- So, give row address one cycle
- Column address some number of cycles later (say
3) - Date comes out later (say 2 cycles later)
- Burst modes
- Typical might be 1,2,4,8, or 256 length burst
- Thus, only give RAS and CAS once for all of these
accesses - Multi-bank operation (on-chip interleaving)
- Lets you overlap startup latency (5 cycles above)
of two banks - Careful of timing specs!
- 10ns SDRAM may still require 50ns to get first
data! - 50ns DRAM means first data out in 50ns
121Other Types of DRAM
- Extended data out (EDO) DRAM
- similar to Fast Page Mode DRAM
- additional feature that a new access cycle can be
started while keeping the data output of the
previous cycle active. This allows a certain
amount of overlap in operation (pipelining),
allowing somewhat improved speed. It was 5
faster than Fast Page Mode DRAM, which it began
to replace in 1993.
122Other Types of DRAM
- Double data rate (DDR) SDRAM
- Double data rate (DDR) SDRAM is a later
development of SDRAM, used in PC memory from 2000
onwards. All types of SDRAM use a clock signal
that is a square wave. - This means that the clock alternates regularly
between one voltage (low) and another (high),
usually millions of times per second. Plain
SDRAM, like most synchronous logic circuits, acts
on the low-to-high transition of the clock and
ignores the opposite transition. DDR SDRAM acts
on both transitions, thereby halving the required
clock rate for a given data transfer rate.
123Memory Systems Delay more than RAW DRAM
n
address
DRAM Controller
DRAM 2n x 1 chip
Memory Timing Controller
w
Bus Drivers
Tc Tcycle Tcontroller Tdriver
124DRAMs over Time
DRAM Generation
84 87 90 93 96 99 1 Mb 4 Mb 16 Mb
64 Mb 256 Mb 1 Gb 55 85 130 200 300 450 30 47 7
2 110 165 250 28.84 11.1 4.26 1.64 0.61 0.23
1st Gen. Sample Memory Size Die Size (mm2) Memory
Area (mm2) Memory Cell Area (µm2)
(from Kazuhiro Sakashita, Mitsubishi)
125Summary
- Two Different Types of Locality
- Temporal Locality (Locality in Time) If an item
is referenced, it will tend to be referenced
again soon. - Spatial Locality (Locality in Space) If an item
is referenced, items whose addresses are close by
tend to be referenced soon. - By taking advantage of the principle of locality
- Present the user with as much memory as is
available in the cheapest technology. - Provide access at the speed offered by the
fastest technology. - DRAM is slow but cheap and dense
- Good choice for presenting the user with a BIG
memory system - SRAM is fast but expensive and not very dense
- Good choice for providing the user FAST access
time.