OMSE 510: Computing Foundations 2: Disks, Buses, DRAM - PowerPoint PPT Presentation

About This Presentation
Title:

OMSE 510: Computing Foundations 2: Disks, Buses, DRAM

Description:

Seagate Barracuda: 7200 RPM (Disks these days are 3600, 4800, 5400, 7200 up to 10800 RPM) ... Desktop: Seagate 250GB, 7200RPM, SATA II, 9-11ms seek. Buffer to ... – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 126
Provided by: franci52
Learn more at: http://web.cecs.pdx.edu
Category:

less

Transcript and Presenter's Notes

Title: OMSE 510: Computing Foundations 2: Disks, Buses, DRAM


1
OMSE 510 Computing Foundations2 Disks, Buses,
DRAM
  • Portland State University/OMSE

2
Outline of Comp. Architecture
Outline of the rest of the computer architecture
section Start with a description of computer
devices, work back towards the CPU.

3
Computer Architecture Is
  • the attributes of a computing system as seen
    by the programmer, i.e., the conceptual structure
    and functional behavior, as distinct from the
    organization of the data flows and controls the
    logic design, and the physical implementation.
  • Amdahl, Blaaw, and Brooks, 1964

SOFTWARE
4
Today
  • Begin Computer Architecture
  • Disk Drives
  • The Bus
  • Memory


5
Computer System (Idealized)
Disk
Memory
CPU
Disk Controller
6
I/O Device Examples
Device Behavior Partner Data Rate
(KB/sec) Keyboard Input Human
0.01 Mouse Input Human
0.02 Line Printer Output Human
1.00 Floppy disk Storage Machine
50.00 Laser Printer Output Human
100.00 Optical Disk Storage Machine
500.00 Magnetic Disk Storage Machine
5,000.00 Network-LAN Input or Output
Machine 20 1,000.00 Graphics Display
Output Human 30,000.00

7
A Device The Disk
Disk Drives! - eg. Your hard disk drive -
Where files are physically stored - Long-term
non-volatile storage device

8
Magnetic Drum
9
Spiral Format for Compact Disk
10
A Device The Disk
Magnetic Disks - Your hard disk drive - Where
files are physically stored - Long-term
non-volatile storage device

11
A Magnetic Disk with Three Platters
12
Organization of a Disk Platter with a 12
Interleave Factor
13
Disk Physical Characteristics
  • Platters
  • 1 to 20 with diameters from 1.3 to 8 inches
    (Recording on both sides)
  • Tracks
  • 2500 to 5000 Tracks/inch
  • Cylinders
  • all tracks in the same position in the platters
  • Sectors
  • 128-256 sectors/track with gaps and info related
    to sectors between them (typical sector, 256-512
    bytes)


14
Disk Physical Characteristics
  • Trend as of 2005
  • Constant bit density (105 bits/inch)
  • Ie. More info (sectors) on outer tracks
  • Strangely enough, history reverses itself
  • Originally, disks were constant bit density (more
    efficient)
  • Then, went to uniform sectors/track (simpler,
    allowed easier optimization)
  • Returning now to constant bit density
  • Disk capacity follows Moores law doubles every
    18 months


15
Example Seagate Barracuda
  • Disk for server
  • 10 disks hence 20 surfaces
  • 7500 cylinders, hence 750020 150000 total
    tracks
  • 237 sectors/track (average)
  • 512 bytes/sector
  • Total capacity
  • 150000 237 512 18,201,600,000 bytes
  • 18 GB


16
Things to consider
  • Addressing modes
  • Computers always refer to data in blocks
    (512bytes common)
  • How to address blocks?
  • Old school CHS (Cylinder-Head-Sector)
  • Computer has an idea how the drive is structured
  • New School LBA (Large Block Addressing)
  • Linear!


17
Disk Performance
  • Steps to read from disk
  • CPU tells drive controller need data from this
    address
  • Drive decodes instruction
  • Move read head over desired cylinder/track (seek)
  • Wait for desired sector to rotate under read head
  • Read the data as it goes under drive head


18
Disk Performance
  • Components of disk performance
  • Seek time (to move the arm on the right cylinder)
  • Rotation time (on average ½ rotation) (time to
    find the right sector)
  • Transfer time depends on rotation time
  • Disk controller time. Overhead to perform an
    access


19
Disk Performance
  • So Disk Latency Queuing Time Controller time
    Seek time Rotation time Transfer time


20
Seek Time
  • From 0 (if arm already positioned) to a maximum
    15-20 ms
  • Note This is not a linear function of distance
    (speedup coast slowdown settle)
  • Even when reading tracks on the same cylinder,
    there is a minimal seek time (due to severe
    tolerances for head positioning)
  • Barracuda example Average seek time 8 ms,
    track to track seek time 1 ms, full disk seek
    17ms


21
Rotation time
  • Rotation time
  • Seagate Barracuda 7200 RPM
  • (Disks these days are 3600, 4800, 5400, 7200 up
    to 10800 RPM)
  • 7200 RPM 120 RPS 8.33ms per rotation
  • Average rotational latency ½ worst case
    rotational latency 4.17ms


22
Transfer time
  • Transfer time depends on rotation time, amount of
    data to transfer (minimum one sector), recording
    density, disk/memory connection
  • These days, transfer time around 2MB/s to 16MB/s


23
Disk Controller Overhead
  • Disk controller contains a microprocessor
    buffer memory possibly a cache (for disk
    sectors)
  • Overhead to perform an access (of the order of
    1ms)
  • Receiving orders from CPU and interpreting them
  • Managing the transfer between disk and memory
    (eg. Managing the DMA)
  • Transfer rate between disk and controller is
    smaller than between disk and memory, hence
  • Need for a buffer in controller
  • This buffer might take the form of a cache
    (mostly for read-ahead and write-behind)


24
Disk Time Example
  • Disk Parameters
  • Transfer size is 8K bytes
  • Advertised average seek is 12 ms
  • Disk spins at 7200 RPM
  • Transfer rate is 4MB/s
  • Controller overhead is 2ms
  • Assume disk is idle so no queuing delay
  • What is Average disk time for a sector?
  • avg seek avg rot delay transfer time
    controller overhead
  • ____ _____ _____ _____


25
Disk Time Example
  • Answer 20ms
  • But! Advertised seek time assumes no locality
    typically ÂĽ to 1/3rd advertised seek time!
  • 20ms-gt12ms
  • Locality is an effect of smart placement of data
    by the operating system


26
My Disk
  • Hitachi Travelstar 7K100 60GB ATA-6 2.5in
  • 7200RPM Mobile Hard Drive w/8MB Buffer
  • Interface
  • ATA-6 Capacity (GB)1 60 Sector size (bytes)
    512 Data heads 3 Disks 2
  • Performance
  • Data buffer (MB) 8 Rotational speed (rpm)
    7,200 Latency (average ms) 4.2 Media transfer
    rate (Mbits/sec) 561 Max.interface transfer
    rate (MB/sec)      100 Ultra DMA mode-5
         16.6 PIO mode-4Command Overhead 1ms
  • Seek time (ms)     Average 10 R / 11 W
    Track to track 1 R / 1.2 W Full
    stroke18 R / 19 W
  • Sectors per Track 414-792Max.areal density
    (Gbits/sq.inch) 66
  • Disk to buffer data transfer 267-629 Mb/s
  • Buffer-host data transfer 100 MB/s


27
Some other quotes
Hard Drives Notebook Toshiba MK8026GAX 80GB,
2.5", 9.5mm, 5400 RPM, 12ms seek,
100MB/s Desktop Seagate 250GB, 7200RPM, SATA
II, 9-11ms seek Buffer to host 300MB/s Buffer
to disk 93MB/s Server Seagate Raptor SATA,
10000RPM, SATA Buffer to host 150MB/s Buffer
to disk 72MB/s

28
Next Topic
  • Disk Arrays
  • RAID!


29
Technology Trends
Disk Capacity now doubles every 18
months before 1990 every 36 months
  • Today Processing Power Doubles Every 18 months
  •  Today Memory Size Doubles Every 18
    months(4X/3yr)
  •  Today Disk Capacity Doubles Every 18 months
  •  Disk Positioning Rate (Seek Rotate) Doubles
    Every Ten Years!
  • Caches in Memory and Device Controllers to Close
    the Gap

The I/O GAP
30
Manufacturing Advantages of Disk Arrays
Disk Product Families
Conventional 4 disk designs
14
10
5.25
3.5
High End
Low End
Disk Array 1 disk design
3.5
31
Small of Large Disks ? Large of Small Disks!
IBM 3390 (K) 20 GBytes 97 cu. ft. 3 KW 15
MB/s 600 I/Os/s 250 KHrs 250K
IBM 3.5" 0061 320 MBytes 0.1 cu. ft. 11 W 1.5
MB/s 55 I/Os/s 50 KHrs 2K
3.5x70 23 GBytes 11 cu. ft. 1 KW 120
MB/s 3900 IOs/s ??? Hrs 150K
Data Capacity Volume Power Data Rate I/O
Rate MTTF Cost
large data and I/O rates high MB per cu. ft.,
high MB per KW reliability?
Disk Arrays have potential for
32
Array Reliability
  • Reliability of N disks Reliability of 1 Disk
    N
  • 50,000 Hours 70 disks 700 hours
  • Disk system MTTF Drops from 6 years to 1
    month!
  • Arrays (without redundancy) too unreliable to
    be useful!

Hot spares support reconstruction in parallel
with access very high media availability can be
achieved
33
Media Bandwidth/Latency Demands
  • Bandwidth requirements
  • High quality video
  • Digital data (30 frames/s) (640 x 480 pixels)
    (24-b color/pixel) 221 Mb/s (27.625 MB/s)
  • High quality audio
  • Digital data (44,100 audio samples/s) (16-b
    audio samples) (2 audio channels for stereo)
    1.4 Mb/s (0.175 MB/s)
  • Compression reduces the bandwidth requirements
    considerably
  • Latency issues
  • How sensitive is your eye (ear) to variations in
    video (audio) rates?
  • How can you ensure a constant rate of delivery?
  • How important is synchronizing the audio and
    video streams?
  • 15 to 20 ms early to 30 to 40 ms late tolerable


34
Dependability, Reliability, Availability
  • Reliability a measure of the reliability
    measured by the mean time to failure (MTTF).
    Service interruption is measured by mean time to
    repair (MTTR)
  • Availability a measure of service
    accomplishment
  • Availability MTTF/(MTTF MTTR)
  • To increase MTTF, either improve the quality of
    the components or design the system to continue
    operating in the presence of faulty components
  • Fault avoidance preventing fault occurrence by
    construction
  • Fault tolerance using redundancy to correct or
    bypass faulty components (hardware)
  • Fault detection versus fault correction
  • Permanent faults versus transient faults


35
RAIDs Disk Arrays
Redundant Array of Inexpensive Disks
  • Arrays of small and inexpensive disks
  • Increase potential throughput by having many disk
    drives
  • Data is spread over multiple disks
  • Multiple accesses are made to several disks at a
    time
  • Reliability is lower than a single disk
  • But availability can be improved by adding
    redundant disks (RAID)
  • Lost information can be reconstructed from
    redundant information
  • MTTR mean time to repair is in the order of
    hours
  • MTTF mean time to failure of disks is tens of
    years


36
RAID Level 0 (No Redundancy Striping)
S0,b0
S0,b2
S0,b1
S0,b3
sector number
bit number
  • Multiple smaller disks as opposed to one big disk
  • Spreading the data over multiple disks striping
    forces accesses to several disks in parallel
    increasing the performance
  • Four times the throughput for a 4 disk system
  • Same cost as one big disk assuming 4 small
    disks cost the same as one big disk
  • No redundancy, so what if one disk fails?
  • Failure of one or more disks is more likely as
    the number of disks in the system increases


37
RAID Level 1 (Redundancy via Mirroring)
S0,b0
S0,b2
S0,b1
S0,b3
S0,b0
S0,b1
S0,b2
S0,b3
redundant (check) data
  • Uses twice as many disks as RAID 0 (e.g., 8
    smaller disks with second set of 4 duplicating
    the first set) so there are always two copies of
    the data
  • Still four times the throughput
  • redundant disks of data disks so twice
    the cost of one big disk
  • writes have to be made to both sets of disks, so
    writes would be only 1/2 the performance of RAID
    0
  • What if one disk fails?
  • If a disk fails, the system just goes to the
    mirror for the data


38
RAID Level 2 (Redundancy via ECC)
S0,b0
S0,b2
S0,b1
S0,b3
1
0
0
0
1
1
1
0
3
5
6
7
4
2
1
ECC disks
ECC disks 4 and 2 point to either data disk 6 or
7,
but ECC disk 1 says disk 7 is okay,
so disk 6 must be in error
  • ECC disks contain the parity of data on a set of
    distinct overlapping disks
  • Still four times the throughput
  • redundant disks log (total of disks) so
    almost twice the cost of one big disk
  • writes require computing parity to write to the
    ECC disks
  • reads require reading ECC disk and confirming
    parity
  • Can tolerate limited disk failure, since the data
    can be reconstructed


39
RAID Level 3 (Bit-Interleaved Parity)
S0,b0
S0,b2
S0,b1
S0,b3
1
0
0
1
1
parity disk
disk fails
  • Cost of higher availability is reduced to 1/N
    where N is the number of disks in a protection
    group
  • Still four times the throughput
  • redundant disks 1 of protection groups
  • writes require writing the new data to the data
    disk as well as computing the parity, meaning
    reading the other disks, so that the parity disk
    can be updated
  • Can tolerate limited disk failure, since the data
    can be reconstructed
  • reads require reading all the operational data
    disks as well as the parity disk to calculate the
    missing data that was stored on the failed disk


40
RAID Level 4 (Block-Interleaved Parity)
parity disk
  • Cost of higher availability still only 1/N but
    the parity is stored as blocks associated with a
    set of data blocks
  • Still four times the throughput
  • redundant disks 1 of protection groups
  • Supports small reads and small writes (reads
    and writes that go to just one (or a few) data
    disk in a protection group)
  • by watching which bits change when writing new
    information, need only to change the
    corresponding bits on the parity disk
  • the parity disk must be updated on every write,
    so it is a bottleneck for back-to-back writes
  • Can tolerate limited disk failure, since the data
    can be reconstructed


41
Block Writes
  • RAID 3 block writes

New data
D0
D1
D2
D3
P
?
5 writes involving all the disks
D0
D1
D2
D3
P
  • RAID 4 small writes

?
2 reads and 2 writes involving just two
disks
?
42
RAID Level 5 (Distributed Block-Interleaved
Parity)
  • Cost of higher availability still only 1/N but
    the parity is spread throughout all the disks so
    there is no single bottleneck for writes
  • Still four times the throughput
  • redundant disks 1 of protection groups
  • Supports small reads and small writes (reads
    and writes that go to just one (or a few) data
    disk in a protection group)
  • Allows multiple simultaneous writes as long as
    the accompanying parity blocks are not located on
    the same disk
  • Can tolerate limited disk failure, since the data
    can be reconstructed


43
Problems of Disk Arrays Block Writes
RAID-5 Small Write Algorithm
1 Logical Write 2 Physical Reads 2 Physical
Writes
D0
D1
D2
D3
D0'
P
old data
new data
old parity
(1. Read)
(2. Read)
XOR


XOR
(3. Write)
(4. Write)
D0'
D1
D2
D3
P'
44
Distributing Parity Blocks
RAID 4
RAID 5
0 1 2 3 P0
0 1 2 3 P0
4 5 6 P1 7
4 5 6 7 P1
8 9 10 11 P2
8 9 P2 10 11
12 P3 13 14 15
12 13 14 15 P3
  • By distributing parity blocks to all disks, some
    small writes can be performed in parallel

45
Disks Summary
  • Four components of disk access time
  • Seek Time advertised to be 3 to 14 ms but lower
    in real systems
  • Rotational Latency 5.6 ms at 5400 RPM and 2.0
    ms at 15000 RPM
  • Transfer Time 10 to 80 MB/s
  • Controller Time typically less than .2 ms
  • RAIDS can be used to improve availability
  • RAID 0 and RAID 5 widely used in servers, one
    estimate is that 80 of disks in servers are
    RAIDs
  • RAID 1 (mirroring) EMC, Tandem, IBM
  • RAID 3 Storage Concepts
  • RAID 4 Network Appliance
  • RAIDS have enough redundancy to allow continuous
    operation


46
Computer System (Idealized)
Disk
Memory
CPU
Disk Controller
47
Next Topic
  • Buses


48
What is a bus?
  • A Bus Is
  • shared communication link
  • single set of wires used to connect multiple
    subsystems
  • A Bus is also a fundamental tool for composing
    large, complex systems
  • systematic means of abstraction

49
Bridge Based Bus Arch-itecture
  • Bridging with dual Pentium II Xeon processors on
    Slot 2.
  • (Source http//www.intel.com.)

50
Buses
51
Advantages of Buses
I/O Device
I/O Device
I/O Device
  • Versatility
  • New devices can be added easily
  • Peripherals can be moved between computersystems
    that use the same bus standard
  • Low Cost
  • A single set of wires is shared in multiple ways

52
Disadvantage of Buses
I/O Device
I/O Device
I/O Device
  • It creates a communication bottleneck
  • The bandwidth of that bus can limit the maximum
    I/O throughput
  • The maximum bus speed is largely limited by
  • The length of the bus
  • The number of devices on the bus
  • The need to support a range of devices with
  • Widely varying latencies
  • Widely varying data transfer rates

53
The General Organization of a Bus
Control Lines
Data Lines
  • Control lines
  • Signal requests and acknowledgments
  • Indicate what type of information is on the data
    lines
  • Data lines carry information between the source
    and the destination
  • Data and Addresses
  • Complex commands

54
Master versus Slave
Master issues command
Bus Master
Bus Slave
Data can go either way
  • A bus transaction includes two parts
  • Issuing the command (and address) request
  • Transferring the data
    action
  • Master is the one who starts the bus transaction
    by
  • issuing the command (and address)
  • Slave is the one who responds to the address by
  • Sending data to the master if the master ask for
    data
  • Receiving data from the master if the master
    wants to send data

55
Types of Buses
  • Processor-Memory Bus (design specific)
  • Short and high speed
  • Only need to match the memory system
  • Maximize memory-to-processor bandwidth
  • Connects directly to the processor
  • Optimized for cache block transfers
  • I/O Bus (industry standard)
  • Usually is lengthy and slower
  • Need to match a wide range of I/O devices
  • Connects to the processor-memory bus or backplane
    bus
  • Backplane Bus (standard or proprietary)
  • Backplane an interconnection structure within
    the chassis
  • Allow processors, memory, and I/O devices to
    coexist
  • Cost advantage one bus for all components

56
Example Pentium System Organization
Processor/Memory Bus -- Design Specific
Backplane Bus PCI PCI Devices Graphics
IO Control
I/O Busses IDE, USB SCSI
57
Standard Intel Pentium Read and Write Bus Cycles
58
Intel Pentium Burst Read Bus Cycle
59
A Computer System with One Bus Backplane Bus
Backplane Bus
Processor
Memory
I/O Devices
  • A single bus (the backplane bus) is used for
  • Processor to memory communication
  • Communication between I/O devices and memory
  • Advantages Simple and low cost
  • Disadvantages slow and the bus can become a
    major bottleneck
  • Example IBM PC - AT

60
A Two-Bus System
  • I/O buses tap into the processor-memory bus via
    bus adaptors to speed match between bus types
  • Processor-memory bus mainly for processor-memory
    traffic
  • I/O buses provide expansion slots for I/O
    devices
  • Apple Macintosh-II
  • NuBus Processor, memory, and a few selected I/O
    devices
  • SCSI Bus the rest of the I/O devices

61
A Three-Bus System ( backside cache)
  • A small number of backplane buses tap into the
    processor-memory bus
  • Processor-memory bus focus on traffic to/from
    memory
  • I/O buses are connected to the backplane bus
  • Advantage loading on the processor bus is
    greatly reduced busses run at different speeds

62
Main components of Intel Chipset Pentium II/III
  • Northbridge
  • Handles memory
  • Graphics
  • Southbridge I/O
  • PCI bus
  • Disk controllers
  • USB controllers
  • Audio (AC97)
  • Serial I/O
  • Interrupt controller
  • Timers

63
What defines a bus?
Transaction Protocol
Timing and Signaling Specification
Bunch of Wires
Electrical Specification
Physical / Mechanical Characterisics the
connectors
64
Synchronous and Asynchronous Bus
  • Synchronous Bus
  • Includes a clock in the control lines
  • A fixed protocol for communication that is
    relative to the clock
  • Advantage involves very little logic and can run
    very fast
  • Disadvantages
  • Every device on the bus must run at the same
    clock rate
  • To avoid clock skew, they cannot be long if they
    are fast
  • Asynchronous Bus
  • It is not clocked
  • It can accommodate a wide range of devices
  • It can be lengthened without worrying about clock
    skew
  • It requires a handshaking protocol

65
Busses so far
Master
Slave

Control Lines
Address Lines
Data Lines
Bus Master has ability to control the bus,
initiates transaction Bus Slave module
activated by the transaction Bus Communication
Protocol specification of sequence of events
and timing requirements in transferring
information. Asynchronous Bus Transfers
control lines (req, ack) serve to orchestrate
sequencing. Synchronous Bus Transfers sequence
relative to common clock.
66
Simplest bus paradigm
  • All agents operate synchronously
  • All can source / sink data at same rate
  • gt simple protocol
  • just manage the source and target

67
Simple Synchronous Protocol
BReq
BG
R/W Address
CmdAddr
Data1
Data2
Data
  • Even memory busses are more complex than this
  • memory (slave) may take time to respond
  • it may need to control data rate

68
Typical Synchronous Protocol
BReq
BG
R/W Address
CmdAddr
Wait
Data1
Data2
Data1
Data
  • Slave indicates when it is prepared for data xfer
  • Actual transfer goes at bus rate

69
Asynchronous Handshake
Write Transaction
Address Data Read Req Ack
Master Asserts Address
Next Address
Master Asserts Data
t0 t1 t2 t3 t4
t5
t0 Master has obtained control and asserts
address, direction, data Waits a specified
amount of time for slaves to decode target. t1
Master asserts request line t2 Slave asserts
ack, indicating data received t3 Master
releases req t4 Slave releases ack
70
Read Transaction
Address Data Read Req Ack
Master Asserts Address
Next Address
Slave Data
t0 t1 t2 t3 t4
t5
t0 Master has obtained control and asserts
address, direction, data Waits a specified
amount of time for slaves to decode target. t1
Master asserts request line t2 Slave asserts
ack, indicating ready to transmit data t3
Master releases req, data received t4 Slave
releases ack
71
What is DMA (Direct Memory Access)?
  • Typical I/O devices must transfer large amounts
    of data to memory of processor
  • Disk must transfer complete block (4K? 16K?)
  • Large packets from network
  • Regions of video frame buffer
  • DMA gives external device ability to write memory
    directly much lower overhead than having
    processor request one word at a time.
  • Processor (or at least memory system) acts like
    slave
  • Issue Cache coherence
  • What if I/O devices write data that is currently
    in processor Cache?
  • The processor may never see new data!
  • Solutions
  • Flush cache on every I/O operation (expensive)
  • Have hardware invalidate cache lines

72
Bus Transaction
Arbitration Who gets the bus Request What do
we want to do Action What happens in response
73
Arbitration Obtaining Access to the Bus
Control Master initiates requests
Bus Master
Bus Slave
Data can go either way
  • One of the most important issues in bus design
  • How is the bus reserved by a device that wishes
    to use it?
  • Chaos is avoided by a master-slave arrangement
  • Only the bus master can control access to the
    bus
  • It initiates and controls all bus requests
  • A slave responds to read and write requests
  • The simplest system
  • Processor is the only bus master
  • All bus requests must be controlled by the
    processor
  • Major drawback the processor is involved in
    every transaction

74
Multiple Potential Bus Masters the Need for
Arbitration
  • Bus arbitration scheme
  • A bus master wanting to use the bus asserts the
    bus request
  • A bus master cannot use the bus until its request
    is granted
  • A bus master must signal to the arbiter after
    finish using the bus
  • Bus arbitration schemes usually try to balance
    two factors
  • Bus priority the highest priority device should
    be serviced first
  • Fairness Even the lowest priority device should
    never be completely locked out
    from the bus
  • Bus arbitration schemes can be divided into four
    broad classes
  • Daisy chain arbitration
  • Centralized, parallel arbitration
  • Distributed arbitration by self-selection each
    device wanting the bus places a code indicating
    its identity on the bus.
  • Distributed arbitration by collision detection
    Each device just goes for it. Problems
    found after the fact.

75
The Daisy Chain Bus Arbitrations Scheme
Device 1 Highest Priority
Device N Lowest Priority
Device 2
Grant
Grant
Grant
Release
Bus Arbiter
Request
wired-OR
  • Advantage simple
  • Disadvantages
  • Cannot assure fairness A low-priority
    device may be locked out indefinitely
  • The use of the daisy chain grant signal also
    limits the bus speed

76
Centralized Parallel Arbitration
Device 1
Device N
Device 2
Req
Grant
Bus Arbiter
  • Used in essentially all processor-memory busses
    and in high-speed I/O busses

77
Increasing the Bus Bandwidth
  • Separate versus multiplexed address and data
    lines
  • Address and data can be transmitted in one bus
    cycleif separate address and data lines are
    available
  • Cost (a) more bus lines, (b) increased
    complexity
  • Data bus width
  • By increasing the width of the data bus,
    transfers of multiple words require fewer bus
    cycles
  • Example SPARCstation 20s memory bus is 128 bit
    wide
  • Cost more bus lines
  • Block transfers
  • Allow the bus to transfer multiple words in
    back-to-back bus cycles
  • Only one address needs to be sent at the
    beginning
  • The bus is not released until the last word is
    transferred
  • Cost (a) increased complexity (b)
    decreased response time for request

78
Increasing Transaction Rate on Multimaster Bus
  • Overlapped arbitration
  • perform arbitration for next transaction during
    current transaction
  • Bus parking
  • master can holds onto bus and performs multiple
    transactions as long as no other master makes
    request
  • Overlapped address / data phases (prev. slide)
  • requires one of the above techniques
  • Split-phase (or packet switched) bus
  • completely separate address and data phases
  • arbitrate separately for each
  • address phase yield a tag which is matched with
    data phase
  • All of the above in most modern buses

79
PCI Read/Write Transactions
  • All signals sampled on rising edge
  • Centralized Parallel Arbitration
  • overlapped with previous transaction
  • All transfers are (unlimited) bursts
  • Address phase starts by asserting FRAME
  • Next cycle initiator asserts cmd and address
  • Data transfers happen on when
  • IRDY asserted by master when ready to transfer
    data
  • TRDY asserted by target when ready to transfer
    data
  • transfer when both asserted on rising edge
  • FRAME deasserted when master intends to complete
    only one more data transfer

80
PCI Read Transaction
Turn-around cycle on any signal driven by more
than one agent
81
PCI Write Transaction
82
PCI Optimizations
  • Push bus efficiency toward 100 under common
    simple usage
  • like RISC
  • Bus Parking
  • retain bus grant for previous master until
    another makes request
  • granted master can start next transfer without
    arbitration
  • Arbitrary Burst length
  • initiator and target can exert flow control with
    xRDY
  • target can disconnect request with STOP (abort or
    retry)
  • master can disconnect by deasserting FRAME
  • arbiter can disconnect by deasserting GNT
  • Delayed (pended, split-phase) transactions
  • free the bus after request to slow device

83
Summary
  • Buses are an important technique for building
    large-scale systems
  • Their speed is critically dependent on factors
    such as length, number of devices, etc.
  • Critically limited by capacitance
  • Important terminology
  • Master The device that can initiate new
    transactions
  • Slaves Devices that respond to the master
  • Two types of bus timing
  • Synchronous bus includes clock
  • Asynchronous no clock, just REQ/ACK strobing
  • Direct Memory Access (dma) allows fast, burst
    transfer into processors memory
  • Processors memory acts like a slave
  • Probably requires some form of cache-coherence so
    that DMAed memory can be invalidated from cache.

84
The Big Picture Where are We Now?
  • The Five Classic Components of a Computer
  • Next Topic
  • Locality and Memory Hierarchy
  • SRAM Memory Technology
  • DRAM Memory Technology
  • Memory Organization

Processor
Input
Control
Memory
Datapath
Output
85
Technology Trends
Capacity Speed (latency) Logic 2x
in 3 years 2x in 3 years DRAM 4x in 3
years 2x in 10 years Disk 4x in 3 years 2x
in 10 years
DRAM Year Size Cycle
Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1
Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145
ns 1995 64 Mb 120 ns
10001!
21!
86
Who Cares About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
Less Law?
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
87
Todays Situation Microprocessor
  • Rely on caches to bridge gap
  • Microprocessor-DRAM performance gap
  • time of a full cache miss in instructions
    executed
  • 1st Alpha (7000) 340 ns/5.0 ns  68 clks x 2
    or 136 instructions
  • 2nd Alpha (8400) 266 ns/3.3 ns  80 clks x 4
    or 320 instructions
  • 3rd Alpha (t.b.d.) 180 ns/1.7 ns 108 clks x 6
    or 648 instructions
  • 1/2X latency x 3X clock rate x 3X Instr/clock ?
    5X

88
Cache Performance
  • CPU time (CPU execution clock cycles
    Memory stall clock cycles) x clock cycle time
  • Memory stall clock cycles (Reads x Read miss
    rate x Read miss penalty Writes x Write miss
    rate x Write miss penalty)
  • Memory stall clock cycles Memory accesses x
    Miss rate x Miss penalty

89
Impact on Performance
  • Suppose a processor executes at
  • Clock Rate 200 MHz (5 ns per cycle)
  • Base CPI 1.1
  • 50 arith/logic, 30 ld/st, 20 control
  • Suppose that 10 of memory operations get 50
    cycle miss penalty
  • Suppose that 1 of instructions get same miss
    penalty
  • CPI Base CPI average stalls per
    instruction 1.1(cycles/ins)
    0.30 (DataMops/ins) x 0.10 (miss/DataMop) x
    50 (cycle/miss) 1 (InstMop/ins) x 0.01
    (miss/InstMop) x 50 (cycle/miss) (1.1
    1.5 .5) cycle/ins 3.1
  • 55 of the time the proc is stalled waiting for
    memory!

90
The Goal illusion of large, fast, cheap memory
  • Fact Large memories are slow, fast memories are
    small
  • How do we create a memory that is large, cheap
    and fast (most of the time)?
  • Hierarchy
  • Parallelism

91
Why hierarchy works
  • The Principle of Locality
  • Program access a relatively small portion of the
    address space at any instant of time.

92
Memory Hierarchy How Does it Work?
  • Temporal Locality (Locality in Time)
  • gt Keep most recently accessed data items closer
    to the processor
  • Spatial Locality (Locality in Space)
  • gt Move blocks consists of contiguous words to
    the upper levels

93
Memory Hierarchy Terminology
  • Hit data appears in some block in the upper
    level of the hierarchy (example Block X is
    found in the L1 cache)
  • Hit Rate the fraction of memory access found in
    the upper level
  • Hit Time Time to access the upper level which
    consists of
  • RAM access time Time to determine hit/miss
  • Miss data needs to be retrieve from a block in
    the lower level in the hierachy (Block Y is not
    in L1 cache and must be fetched from main memory)
  • Miss Rate 1 - (Hit Rate)
  • Miss Penalty Time to replace a block in the
    upper level
  • Time to deliver the block the processor
  • Hit Time ltlt Miss Penalty

Lower Level Memory
Upper Level Memory
To Processor
Blk X
From Processor
Blk Y
94
Memory Hierarchy of a Modern Computer System
  • By taking advantage of the principle of locality
  • Present the user with as much memory as is
    available in the cheapest technology.
  • Provide access at the speed offered by the
    fastest technology.

Processor
Control
Secondary Storage (Disk)
Main Memory (DRAM)
Second Level Cache (SRAM)
On-Chip Cache
Datapath
Registers
10,000,000s (10s ms)
1s
Speed (ns)
10s
100s
10,000,000,000s (10s sec)
100s
Gs
Size (bytes)
Ks
Ms
Ts
95
How is the hierarchy managed?
  • Registers lt-gt Memory
  • by compiler (programmer?)
  • cache lt-gt memory
  • by the hardware
  • memory lt-gt disks
  • by the hardware
  • by the operating system (disk caches virtual
    memory)
  • by the programmer (files)

96
Memory Hierarchy Technology
  • Random Access
  • Random is good access time is the same for all
    locations
  • DRAM Dynamic Random Access Memory
  • High density, low power, cheap, slow
  • Dynamic need to be refreshed regularly (1-2
    of cycles)
  • SRAM Static Random Access Memory
  • Low density, high power, expensive, fast
  • Static content will last forever(until lose
    power)
  • Non-so-random Access Technology
  • Access time varies from location to location and
    from time to time
  • Examples Disk, CDROM
  • Sequential Access Technology access time linear
    in location (e.g.,Tape)
  • We will concentrate on random access technology
  • The Main Memory DRAMs Caches SRAMs

97
Main Memory Background
  • Performance of Main Memory
  • Latency Cache Miss Penalty
  • Access Time time between request and word
    arrives
  • Cycle Time time between requests
  • Bandwidth I/O Large Block Miss Penalty (L2)
  • Main Memory is DRAM Dynamic Random Access
    Memory
  • Dynamic since needs to be refreshed periodically
    (8 ms)
  • Addresses divided into 2 halves (Memory as a 2D
    matrix)
  • RAS or Row Access Strobe
  • CAS or Column Access Strobe
  • Cache uses SRAM Static Random Access Memory
  • No refresh (6 transistors/bit vs. 1
    transistor)Size DRAM/SRAM 4-8 Cost/Cycle
    time SRAM/DRAM 8-16

98
Random Access Memory (RAM) Technology
  • Why do computer designers need to know about RAM
    technology?
  • Processor performance is usually limited by
    memory bandwidth
  • As IC densities increase, lots of memory will fit
    on processor chip
  • Tailor on-chip memory to specific needs
  • Instruction cache
  • Data cache
  • Write buffer
  • What makes RAM different from a bunch of
    flip-flops?
  • Density RAM is much denser

99
Main Memory Deep Background
  • Out-of-Core, In-Core, Core Dump?
  • Core memory?
  • Non-volatile, magnetic
  • Lost to 4 Kbit DRAM (today using 64Mbit DRAM)
  • Access time 750 ns, cycle time 1500-3000 ns

100
Static RAM Cell
6-Transistor SRAM Cell
word
word (row select)
0
1
1
0
bit
bit
  • Write
  • 1. Drive bit lines (bit1, bit0)
  • 2. Select row
  • Read
  • 1. Precharge bit and bit to Vdd or Vdd/2 gt make
    sure equal!
  • 2.. Select row
  • 3. Cell pulls one line low
  • 4. Sense amp on column detects difference between
    bit and bit

bit
bit
replaced with pullup to save area
101
Typical SRAM Organization 16-word x 4-bit
Din 0
Din 1
Din 2
Din 3
WrEn
Precharge
A0
Word 0
SRAM Cell
SRAM Cell
SRAM Cell
SRAM Cell
A1
Address Decoder
A2
Word 1
SRAM Cell
SRAM Cell
SRAM Cell
SRAM Cell
A3




Word 15
SRAM Cell
SRAM Cell
SRAM Cell
SRAM Cell
Dout 0
Dout 1
Dout 2
Dout 3
102
Logic Diagram of a Typical SRAM
  • Write Enable is usually active low (WE_L)
  • Din and Dout are combined to save pins
  • A new control signal, output enable (OE_L) is
    needed
  • WE_L is asserted (Low), OE_L is disasserted
    (High)
  • D serves as the data input pin
  • WE_L is disasserted (High), OE_L is asserted
    (Low)
  • D is the data output pin
  • Both WE_L and OE_L are asserted
  • Result is unknown. Dont do that!!!

103
Typical SRAM Timing
Write Timing
Read Timing
High Z
D
Data In
Data Out
Data Out
Junk
A
Write Address
Read Address
Read Address
OE_L
WE_L
Write Hold Time
Read Access Time
Read Access Time
Write Setup Time
104
1-Transistor Memory Cell (DRAM)
  • Write
  • 1. Drive bit line
  • 2.. Select row
  • Read
  • 1. Precharge bit line to Vdd
  • 2.. Select row
  • 3. Cell and bit line share charges
  • Very small voltage changes on the bit line
  • 4. Sense (fancy sense amp)
  • Can detect changes of 1 million electrons
  • 5. Write restore the value
  • Refresh
  • 1. Just do a dummy read to every cell.

row select
bit
105
Classical DRAM Organization (square)
bit (data) lines
r o w d e c o d e r
Each intersection represents a 1-T DRAM Cell
RAM Cell Array
word (row) select
Column Selector I/O Circuits
row address
Column Address
  • Row and Column Address together
  • Select 1 bit a time

data
106
Logic Diagram of a Typical DRAM
OE_L
WE_L
CAS_L
RAS_L
A
256K x 8 DRAM
D
9
8
  • Control Signals (RAS_L, CAS_L, WE_L, OE_L) are
    all active low
  • Din and Dout are combined (D)
  • WE_L is asserted (Low), OE_L is disasserted
    (High)
  • D serves as the data input pin
  • WE_L is disasserted (High), OE_L is asserted
    (Low)
  • D is the data output pin
  • Row and column addresses share the same pins (A)
  • RAS_L goes low Pins A are latched in as row
    address
  • CAS_L goes low Pins A are latched in as column
    address
  • RAS/CAS edge-sensitive

107
DRAM Read Timing
  • Every DRAM access begins at
  • The assertion of the RAS_L
  • 2 ways to read early or late v. CAS

DRAM Read Cycle Time
CAS_L
A
Row Address
Junk
Col Address
Row Address
Junk
Col Address
WE_L
OE_L
D
High Z
Data Out
Junk
Data Out
High Z
Read Access Time
Output Enable Delay
Early Read Cycle OE_L asserted before CAS_L
Late Read Cycle OE_L asserted after CAS_L
108
DRAM Write Timing
OE_L
WE_L
CAS_L
RAS_L
  • Every DRAM access begins at
  • The assertion of the RAS_L
  • 2 ways to write early or late v. CAS

A
256K x 8 DRAM
D
9
8
DRAM WR Cycle Time
CAS_L
A
Row Address
Junk
Col Address
Row Address
Junk
Col Address
OE_L
WE_L
D
Junk
Junk
Data In
Data In
Junk
WR Access Time
WR Access Time
Early Wr Cycle WE_L asserted before CAS_L
Late Wr Cycle WE_L asserted after CAS_L
109
Key DRAM Timing Parameters
  • tRAC minimum time from RAS line falling to the
    valid data output.
  • Quoted as the speed of a DRAM
  • A fast 4Mb DRAM tRAC 60 ns
  • tRC minimum time from the start of one row
    access to the start of the next.
  • tRC 110 ns for a 4Mbit DRAM with a tRAC of 60
    ns
  • tCAC minimum time from CAS line falling to valid
    data output.
  • 15 ns for a 4Mbit DRAM with a tRAC of 60 ns
  • tPC minimum time from the start of one column
    access to the start of the next.
  • 35 ns for a 4Mbit DRAM with a tRAC of 60 ns

110
DRAM Performance
  • A 60 ns (tRAC) DRAM can
  • perform a row access only every 110 ns (tRC)
  • perform column access (tCAC) in 15 ns, but time
    between column accesses is at least 35 ns (tPC).
  • In practice, external address delays and turning
    around buses make it 40 to 50 ns
  • These times do not include the time to drive the
    addresses off the microprocessor nor the memory
    controller overhead.
  • Drive parallel DRAMs, external memory controller,
    bus to turn around, SIMM module, pins
  • 180 ns to 250 ns latency from processor to memory
    is good for a 60 ns (tRAC) DRAM

111
Main Memory Performance
  • Wide
  • CPU/Mux 1 word Mux/Cache, Bus, Memory N words
    (Alpha 64 bits 256 bits)
  • Simple
  • CPU, Cache, Bus, Memory same width (32 bits)
  • Interleaved
  • CPU, Cache, Bus 1 word Memory N Modules(4
    Modules) example is word interleaved

112
Main Memory Performance
Cycle Time
Access Time
Time
  • DRAM (Read/Write) Cycle Time gtgt DRAM
    (Read/Write) Access Time
  • 21 why?
  • DRAM (Read/Write) Cycle Time
  • How frequent can you initiate an access?
  • Analogy A little kid can only ask his father for
    money on Saturday
  • DRAM (Read/Write) Access Time
  • How quickly will you get what you want once you
    initiate an access?
  • Analogy As soon as he asks, his father will give
    him the money
  • DRAM Bandwidth Limitation analogy
  • What happens if he runs out of money on Wednesday?

113
Increasing Bandwidth - Interleaving
Access Pattern without Interleaving
CPU
Memory
D1 available
Start Access for D1
Start Access for D2
Memory Bank 0
Access Pattern with 4-way Interleaving
Memory Bank 1
CPU
Memory Bank 2
Memory Bank 3
Access Bank 1
Access Bank 0
Access Bank 2
Access Bank 3
We can Access Bank 0 again
114
Main Memory Performance
  • Timing model
  • 1 to send address,
  • 4 for access time, 10 cycle time, 1 to send data
  • Cache Block is 4 words
  • Simple M.P. 4 x (1101) 48
  • Wide M.P. 1 10 1 12
  • Interleaved M.P. 1101 3 15

115
Independent Memory Banks
  • How many banks?
  • number banks ? number clocks to access word in
    bank
  • For sequential accesses, otherwise will return to
    original bank before it has next word ready
  • Increasing DRAM gt fewer chips gt harder to have
    banks
  • Growth bits/chip DRAM 50-60/yr
  • Nathan Myrvold M/S mature software growth
    (33/yr for NT) growth MB/ of DRAM
    (25-30/yr)

116
Fewer DRAMs/System over Time
(from Pete MacWilliams, Intel)
DRAM Generation
86 89 92 96 99 02 1 Mb 4 Mb 16 Mb 64
Mb 256 Mb 1 Gb
Memory per DRAM growth _at_ 60 / year
4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB
16
4
Minimum PC Memory Size
Memory per System growth _at_ 25-30 / year
117
Fast Page Mode Operation
Column Address
  • Regular DRAM Organization
  • N rows x N column x M-bit
  • Read Write M-bit at a time
  • Each M-bit access requiresa RAS / CAS cycle
  • Fast Page Mode DRAM
  • N x M SRAM to save a row
  • After a row is read into the register
  • Only CAS is needed to access other M-bit blocks
    on that row
  • RAS_L remains asserted while CAS_L is toggled

DRAM
Row Address
N rows
N x M SRAM
M bits
M-bit Output
118
FP Mode DRAM
  • Fast page mode DRAM
  • In page mode, a row of the DRAM can be kept
    "open", so that successive reads or writes within
    the row do not suffer the delay of precharge and
    accessing the row. This increases the performance
    of the system when reading or writing bursts of
    data.

119
Key DRAM Timing Parameters
  • tRAC minimum time from RAS line falling to the
    valid data output.
  • Quoted as the speed of a DRAM
  • A fast 4Mb DRAM tRAC 60 ns
  • tRC minimum time from the start of one row
    access to the start of the next.
  • tRC 110 ns for a 4Mbit DRAM with a tRAC of 60
    ns
  • tCAC minimum time from CAS line falling to valid
    data output.
  • 15 ns for a 4Mbit DRAM with a tRAC of 60 ns
  • tPC minimum time from the start of one column
    access to the start of the next.
  • 35 ns for a 4Mbit DRAM with a tRAC of 60 ns

120
SDRAM Syncronous DRAM?
  • More complicated, on-chip controller
  • Operations syncronized to clock
  • So, give row address one cycle
  • Column address some number of cycles later (say
    3)
  • Date comes out later (say 2 cycles later)
  • Burst modes
  • Typical might be 1,2,4,8, or 256 length burst
  • Thus, only give RAS and CAS once for all of these
    accesses
  • Multi-bank operation (on-chip interleaving)
  • Lets you overlap startup latency (5 cycles above)
    of two banks
  • Careful of timing specs!
  • 10ns SDRAM may still require 50ns to get first
    data!
  • 50ns DRAM means first data out in 50ns

121
Other Types of DRAM
  • Extended data out (EDO) DRAM
  • similar to Fast Page Mode DRAM
  • additional feature that a new access cycle can be
    started while keeping the data output of the
    previous cycle active. This allows a certain
    amount of overlap in operation (pipelining),
    allowing somewhat improved speed. It was 5
    faster than Fast Page Mode DRAM, which it began
    to replace in 1993.

122
Other Types of DRAM
  • Double data rate (DDR) SDRAM
  • Double data rate (DDR) SDRAM is a later
    development of SDRAM, used in PC memory from 2000
    onwards. All types of SDRAM use a clock signal
    that is a square wave.
  • This means that the clock alternates regularly
    between one voltage (low) and another (high),
    usually millions of times per second. Plain
    SDRAM, like most synchronous logic circuits, acts
    on the low-to-high transition of the clock and
    ignores the opposite transition. DDR SDRAM acts
    on both transitions, thereby halving the required
    clock rate for a given data transfer rate.

123
Memory Systems Delay more than RAW DRAM
n
address
DRAM Controller
DRAM 2n x 1 chip
Memory Timing Controller
w
Bus Drivers
Tc Tcycle Tcontroller Tdriver
124
DRAMs over Time
DRAM Generation
84 87 90 93 96 99 1 Mb 4 Mb 16 Mb
64 Mb 256 Mb 1 Gb 55 85 130 200 300 450 30 47 7
2 110 165 250 28.84 11.1 4.26 1.64 0.61 0.23
1st Gen. Sample Memory Size Die Size (mm2) Memory
Area (mm2) Memory Cell Area (µm2)
(from Kazuhiro Sakashita, Mitsubishi)
125
Summary
  • Two Different Types of Locality
  • Temporal Locality (Locality in Time) If an item
    is referenced, it will tend to be referenced
    again soon.
  • Spatial Locality (Locality in Space) If an item
    is referenced, items whose addresses are close by
    tend to be referenced soon.
  • By taking advantage of the principle of locality
  • Present the user with as much memory as is
    available in the cheapest technology.
  • Provide access at the speed offered by the
    fastest technology.
  • DRAM is slow but cheap and dense
  • Good choice for presenting the user with a BIG
    memory system
  • SRAM is fast but expensive and not very dense
  • Good choice for providing the user FAST access
    time.
Write a Comment
User Comments (0)
About PowerShow.com