Title: Storage Systems
1Storage Systems
2Memory Hierarchy III I/O System
Registers
- Boring, but important
- I/O has been the orphan of Comp. Arch
- Most widely used performance measure
- CPU time
- By definition ignores I/O
- Second class citizenship
- peripheral applied to I/O devices
- Common sense
- Response time is better measure
- I/O is a big part of it!
- Customer who pays for it cares,
- Even if computer designer does not
Data Cache (D)
Instruction Cache (I)
L2 Cache
L3 Cache
Memory
Disk (swap)
3I/O (Disk) Performance
- Who cares? You do
- Remember Amdahls Law
- Want fast disk access (fast swap, fast file
reads) - I/O performance
- Bandwidth I/Os per second (IOPS)
- Latency response time
- Is I/O (disk) latency important? Why not just
context-switch? - Context-switching requires more memory
- Context-switching requires jobs to context-switch
to - Context-switching annoys users (productivity
f(1/response time))
4I/O Device Characteristics
- Type
- Input read only
- Output write only
- Storage both
- Partner
- Human
- Machine
- Data rate
- Peak transfer rate
5Disk Parameters
- 1-20 platters (data on both sides)
- Magnetic iron-oxide coating
- 1 read/write head per side
- 500-2,500 tracks per platter
- 32-128 sectors per track
- Sometimes fewer on inside tracks
- 512-2048 bytes per sector
- Usually fixed length
- DataECC (parity) gap
- 4-24 GB total
- 3,000-10,000 RPM
platter
head
spindle
R/W Cache
Controller
track
sector
6Disk Performance
Service time
- Response time
- tdisk tseek trotation ttransfer
tcontroller tqueuing - tseek (seek time) move head to track
- trotation (rotational latency) wait for sector
to come around - average trotation 0.5 /RPS //RPS RPM/60
- ttransfer (transfer time) read disk
- ratetransfer bytes/sector sector/track RPS
- ttransfer bytes transferred /ratetransfer
- tcontroller (controller delay) wait for
controller to do its thing - tqueuing (queueing delay) wait for older
requests to finish
7Example Seegate
- Cheetah 73LP
-
- Model NumberST373405LWCapacity73.4
GBSpeed10000 rpmSeek time5.1 ms
avgInterfaceUltra160 SCSISuggested Resale
Price 980.00
8Disk Performance Example
- Parameters
- 3600 RPM ? 60 RPS
- Avg seek time 9 ms
- 100 sectors per track, 512 bytes per sector
- Controller queuing delays 1 ms
- Q average time to read 1 sector?
- ratetransfer 100 sectors/track 512 B/Sector
60 RPS 24 Mb/s - ttransfer 512 B/24 Mb/s 0.16ms
- trotation 0.5/60 RPS 8.3ms
- tdisk 9ms 8.3 ms 0.2ms 1ms 18.5ms
- ttransfer is only a small component!!
- End of story? No! tqueuing not fixed (gets longer
with more requests)
9Disk Performance Queuing Theory
- I/O is a queuing system
- equilibrium ratearrival ratedeparture
- total time tsystem tqueue tserver
- ratearrival tsystem lengthsystem (Littles
Law) - utilizationserver tserver ratearrival
- The important result (derivation in HP)
- tsystem tserver / (1 utilizationserver )
- If server highly utilized tsystem gets VERY HIGH
- Lesson keep utilization low (below 75)
- Q what is new tdisk if disk is 50 utilized
- tdisk_new tdisk_old /(1-0.5) 37 ms
Server
10Disk Usage Models
- Data mining supercomputing
- Large files, sequential reads
- Raw data transfer rate is most important
- Transaction processing
- Large files, but random access, many small
requests - IOPS is most important
- Time sharing filesystems
- Small files, sequential accesses, potential for
file caching - IOPS is most important
- Must design disk (I/O) system based on target
workload - Use disk benchmarks (they exist)
11Disk Alternatives
- Solid state disk (SSD)
- DRAM battery backup with standard disk
interface - Fast no seek time, no rotation time, fast
transfer rate - Expensive
- FLASH memory
- Fast no seek time, no rotation time, fast
transfer rate - Non-volatile
- Slow bulk erase before write
- Wears out over time
- Optical disks (CDs)
- Cheap if write-once, expensive if write-multiple
- Slow
12Extensions to Conventional Disks
- Increasing density more sensitive heads, finer
control - Increases cost
- Fixed head head per track
- Seek time eliminated
- Low track density
- Parallel transfer simultaneous read from
multiple platters - Difficulty in looking onto different tracks on
multiple surfaces - Lower cost alternatives possible (disk arrays)
13More Extensions to Conventional Disks
- Disk caches disk-controller RAM buffers data
- Fast writes RAM acts as a write buffer
- Better utilization of host-to-device path
- High miss rate increases request latency
- Disk scheduling schedule requests to reduce
latency - e.g., schedule request with shortest seek time
- e.g., elevator algorithm for seeks (head sweeps
back and forth) - Works best for unlikely cases (long queues)
14Disk Arrays
- Collection of individual disks (D of disks)
- Distribute data across disks
- Access in parallel for higher b/w (IOPS)
- Issue data distribution ? load balancing
- e.g., 3 disks, 3 files (A, B, and C) each 2
sectors long
undistributed
coarse-grain striping
fine-grain striping
A0
B0
C0
A0
A1
B0
A0
A0
A0
A1
A1
A1
B0
B0
B0
A1
B1
C1
B1
C0
C1
B1
B1
B1
C0
C0
C0
C1
C1
C1
15Disk Arrays Stripe Width
- Fine-grain striping
- D stripe width evenly divides smallest
accessible data (sector) - Only one request served at a time
- Perfect load balance
- Effective transfer rate approx D times better
than single disk - Access time can go up, unless disks synchronized
(disk skew) - Coarse-grain striping
- Data transfer parallelism for large requests
- Concurrency for small requests (several small
requests at once) - statistical load balance
- Must consider workload to determine stripe width
16Disk Redundancy and RAIDs
- Disk failures are a significant fraction of all
hardware failures - Electrical failures rare, mechanical failures
more common - Striping increases number of files touched by
failure - Fix with replication and / or parity protection
- RAID redundant array of inexpensive disks
- Arrays of cheap disks provide high performance
dependability - MTTF is high and MTTR is low ? redundancy can
increase significantly
17Reliability, Availability, and Dependability
- Reliability and Availability are measure of
Dependability - Reliability measure of the continuous service
accomplishment - MTTF Mean time to failure
- Rate of failure 1/MTTF
- Service interruption is measured as MTTR Mean
time to repair - if a collection of modules have exponentially
distributed lifetimes, - overall failure rate sum of the failure rates
of the modules - Availability measure of the service
accomplishment - MTTF / (MTTF MTTR)
- MTBF Mean time between failures MTTF MTTR
- Widely used
- MTTF is often more appropriate
18Array Reliability
Reliability of N disks Reliability of 1 Disk
N 1,200,000 Hours 100 disks 12,000 hours 1
year 365 24 8,700 hours Disk system MTTF
Drops from 140 years to about 1.5
years! Arrays (without redundancy) too
unreliable to be useful!
19Redundant Arrays of Independent Disks
Files are "striped" across multiple
spindles Redundancy yields high data availability
- Disks will fail!
- Contents reconstructed from data redundantly
stored in the array - Capacity penalty to store it
- Bandwidth penalty to update
Mirroring/Shadowing (high cost) Parity
Techniques
20RAID Levels
- 6 levels of RAID depend on redundancy
/concurrency (D of data disks, C of check
disks) - Level 0 nonredundant striped (D0, C0) widely
used - Level 1 full mirroring (D C)
- Level 2 Memory-style ECC (D8, C 4) Not used
- Level 3 bit-interleaved parity (e.g., D8, C1)
- Level 4 block-interleaved parity (e.g., D8,
C1) - Level 5 block-interleaved distributed parity
(e.g., D8, C1) most widely used - Level 6 two-dimensional error bits (e.g., D8,
C2) Not presently available
21RAID1 Disk Mirroring/Shadowing
recovery group
Each disk is fully duplicated onto its
"shadow" ?high availability Bandwidth
sacrifice on write Logical write ? two
physical writes Reads may be optimized Most
expensive solution 100 capacity overhead
Targeted for high I/O rate , high availability
environments
Probability of failure (assuming 24 hours MTTR)
24 / ( 1.2 X 106 X 1.2 X 106 ) 6.9 x 10-13
170,000,000 years
22RAID 3 Parity Disk
10010011 11001101 10010011 . . .
P
logical record
1 0 0 1 0 0 1 1
1 1 0 0 1 1 0 1
1 0 0 1 0 0 1 1
0 0 1 1 0 0 0 0
Striped physical records
Parity computed across recovery group to
protect against hard disk failures 33
capacity cost for parity in this configuration
wider arrays reduce capacity costs, decrease
expected availability, increase
reconstruction time Arms logically
synchronized, spindles rotationally synchronized
logically a single high capacity, high
transfer rate disk
Targeted for high bandwidth applications
Scientific, Image Processing
23RAID 3 Write Update
RAID-3 Small Write Algorithm
1 Logical Write ? 3 Reads 2 Writes
D0
D1
D2
D3
D0'
P
new data
Involves all the disks
XOR
D0'
D1
D2
D3
P'
24RAID 4 Write Update
RAID-5 Small Write Algorithm
1 Logical Write ? 2 Reads 2 Writes
D0
D1
D2
D3
D0'
P
- Involves just two disks
- Lesser read/write ops
- Increasing size of parity group
- Increases savings
- Bottleneck
- Parity disk update on every write
- Spread parity info on all disks
- ? RAID 5
new data
XOR
XOR
D0'
D1
D2
D3
P'
25RAID 5 High I/O Rate Parity
Increasing Logical Disk Addresses
D0
D1
D2
D3
P
A logical write becomes four physical
I/Os multiple writes can occur simultaneously as
long as stripe units are not located in the same
disks
D4
D5
D6
P
D7
D8
D9
P
D10
D11
D12
P
D13
D14
D15
Stripe
P
D16
D17
D18
D19
Stripe Unit
D20
D21
D22
D23
P
Targeted for mixed applications
. . .
. . .
. . .
. . .
. . .
Disk Columns
26Subsystem Organization
single board disk controller
Cache
array controller
host
host adapter
interface to host, DMA
control, buffering, parity logic
single board disk controller
single board disk controller
no applications modifications no reduction of
host performance
single board disk controller
physical device control
27System Availability Orthogonal RAIDs
Fault-tolerant scheme to protect against string
faults as well as disk faults
28System Level Availability
Fully redundant No single point of failure
host
host
I/O Controller
I/O Controller
Cache Array Controller
Cache Array Controller
. . .
. . .
. . .
. . .
. . .
Recovery Group
. . .
with duplicated paths, higher performance when no
failures
29Basic Computer Structure
CPU
Memory
Memory Bus (System Bus)
Bridge
I/O Bus
NIC
30A Typical PC Bus Structure
31PC Bus View
Processor/Memory Bus
PCI Bus
I/O Buses
32I/O and Memory Buses
Memory buses speed (usually custom design) I/O
buses compatibility (usually industry standard)
cost
33Buses
Network
Channel
Backplane
Connects
Machines
Chips
Devices
gt1000 m
10 - 100 m
0.1 m
Distance
10 - 1000 Mb/s
40 - 1000 Mb/s
320 - 2000 Mb/s
Bandwidth
high ( 1ms)
medium
low (Nanosecs)
Latency
low
medium
high
Reliability
Extensive CRC
Byte Parity
Byte Parity
memory-mapped wide pathways centralized
arbitration
message-based narrow pathways distributed
arbitration
34What Defines A Bus?
Transaction Protocol
Timing and Signaling Specification
Bunch of Wires
Electrical Specification
Physical / Mechanical Characteristics the
connectors
- Glue that interfaces computer system components
35Bus Issues
- Clocking is bus clocked?
- Synchronous
- clocked, little logic ? fast (includes a clock in
the control lines) - all devices need to run at same clock rate
- to avoid clock skew, busses cannot be long if
they are fast - Asynchronous
- No clock, use handshaking instead ? slow
- Can accommodate a wide range of devices
- Can be lengthened without worrying about clock
skew - Switching when control of bus is acquired and
released? - Atomic bus held until request complete ? slow
- Split-transaction bus free between request and
reply ? fast
36More Bus Issues
- Arbitration deciding who gets the bus next?
- Chaos is avoided by master-slave arrangement
- Processor is the only bus master ?CPU involved in
ever transaction - Overlap arbitration for next master with current
transfer - Daisy chain closer devices have priority ?
simple slow - Distributed wired-OR, low priority back-off ?
medium - Other issues
- Split data/address lines, width, burst transfer
Device 1 Highest Priority
Device N Lowest Priority
Device 2
Daisy Chain
Order Request Grant Release
Grant
Grant
Grant
Release
Bus Arbiter
Request
37Asynchronous Handshake Write Operation
Address Data Read Request Acknowledge
Master Asserts Address
Next Address
Master Asserts Data
t0 t1 t2 t3 t4
t5
- t0 Master has obtained control and asserts
address, direction (not read), and data.
Waits a specified amount of time for slaves to
decode target - t1 Master asserts request line
- t2 Slave asserts ack, indicating data received
- t3 Master releases req
- t4 Slave releases ack
38Asynchronous Handshake Read Operation
Address Data Read Req Ack
Master Asserts Address
Next Address
Slave Data
t0 t1 t2 t3 t4
t5
- t0 Master has obtained control and asserts
address and direction - Waits a specified amount of time for slaves to
decode target - t1 Master asserts request line
- t2 Slave asserts ack, indicating ready to
transmit data - t3 Master releases req, data received
- t4 Slave releases ack
39Example PCI Read/Write Transactions
- All signals sampled on rising edge
- Centralized Parallel Arbitration
- overlapped with previous transaction
- All transfers are (unlimited) bursts
- Address phase starts by asserting FRAME
- Next cycle initiator asserts cmd and address
- Data transfers happen on when
- IRDY asserted by master when ready to transfer
data - TRDY asserted by target when ready to transfer
data - transfer when both asserted on rising edge
- FRAME de-asserted when master intends to
complete only one more data transfer
40Example PCI Read Transaction
Turn-around cycle on any signal driven by more
than one agent
41Who Does I/O?
- Main CPU (programmed IO)
- Explicitly executes all I/O operations
- High overhead, potential cache pollution
- Memory mapped I/O
- Physical addresses are set apart (no real
memory!) - When processor sees these addresses, it aims the
instruction to IO processor - I/O Processor (IOP or channel processor)
- Special or general processor dedicated to I/O
operations - Fast
- May be overkill, cache coherence problems
- DMAC (direct memory access controller)
- Can transfer data to/from memory given start
address - Fast, usually simple
- Still may be coherence problems, must be on
memory bus
42Programmed I/O vs. DMA
- Programmed I/O is ok for sending commands,
receiving status, and communication of a small
amount of data - Inefficient for large amount of data
- Keeps CPU busy during the transfer
- Programmed I/O ? memory operations ? slow
- Direct Memory Access
- Device read/write directly from/to memory
- Memory ? device typically initiated from CPU
- Device ? memory can be initiated by either the
device or the CPU
43Programmed I/O vs. DMA
CPU
Memory
CPU
Memory
CPU
Memory
Interconnect
Interconnect
Interconnect
Programmed I/O
DMA
DMA Device ? Memory
44Six Steps to Perform DMA Transfer
45Communicating with I/O Processors
- Not issues if main CPU performs I/O by itself
- I/O control how to initialize DMAC/IOP?
- Memory mapped VM-protected addresses
- Privileged I/O instructions
- I/O completion how does CPU know DMAC/IOP is
finished - Polling periodically check status bit ? slow
- Interrupt I/O completion interrupts CPU ? fast
- Q Do DMAC/IOP use physical or virtual addresses?
- Physical simpler, but can only transfer 1 page
at a time. Why? - Virtual more powerful, but DMAC, IOP needs TLB
46Polling Vs. Interrupts
- Polling
- Busy-wait cycle to wait for I/O from device
- Inefficient unless the device is very fast
- Interrupts
- CPU Interrupt request line triggered by I/O
device - Interrupt handler receives interrupts
- Maskable to ignore or delay some interrupts
- Interrupt vector to dispatch interrupt to correct
handler - Based on priority
- Some unmaskable
- Interrupt mechanism also used for exceptions
47Example Interrupt Vs. DMA
- 1000 transfers at 1000 bytes each
- Interrupt mechanism
- 1000 interrupts _at_ 2 microsec per interrupt
- 1000 interrupt service _at_ 98 microsec each
- ?0.1 CPU sec
- Device transfer rate 10 MB/Sec ? 1000 bytes
100 micro sec - 1000 transfers X 100 microsec 0.1 CPU seconds
- Total 0.2 CPU second ( 50 overhead due to
interrupt handling) - DMA
- 1 DMA setup sequence _at_ 50 microsec
- 1 interrupt _at_ 2microsec
- 1 interrupt service sequence _at_ 48 microsec
- Total 0.0001 second of CPU time
48Other Issues/Trends
- Block Servers versus Filers where is file
illusion maintained? - Traditional answer server
- Access storage as disk blocks and maintain meta
data - Use file cache
- Alternative disk subsystem maintains the file
abstraction - Server uses file system protocol to communicate
- Ex Network File System (NFS) for UNIX, Common
Internet File System (CIFS) for Windows - Referred as Network attached storage (NAS)
devices - Switches Replacing Buses
- Moores Law driving cost of switch components
down - Replace the bus with point to point links and
switches - Switched networks provide higher aggregate
bandwidth - Point-to-point links can be longer
- Ex Infiniband delivers 2-24 Gbps and link can go
up to 17 m (PCI 0.5m)
49I/O System Example
- Given
- 500 MIPS CPU
- 16B wide, 100 ns memory system
- 10,000 instructions per I/O
- 16KB per I/O
- 200 MB/s I/O bus, with room for 20 SCSI-2
controllers - SCSI-2 strings-20MB/s with 15 disks per bus
- SCSI-2 1ms overhead per I/O
- 7,200 RPM (120 RPS), 8ms avg seek, 6MB/s transfer
disks - 200GB total storage
- Q Choose 2GB or 8GB disks for maximum IOPS?
- How to arrange disks and controllers?
Similar example in the book on page 744
50I/O System Example (contd)
- Step 1 Calculate CPU, memory, I/O bus peak IOPS
- CPU 500 MIPS / (10,000 instructions/IO) 50,000
IOPS - Memory (16-bytes / 100 ns) / (16 KB/IO) 10,000
IOPS - I/O Bus (200MB/s) / 16 KB 12,500 IOPS
- Memory bus is the bottleneck with 10,000 IOPS!
- Step 2 Calculate disk IOPS
- tdisk 8 ms 0.5 /120 RPS 16KB /(6MB/s) 15
ms - Disk 1/15ms 67 IOPS
- 8GB disks ? need 25 ? 25 67 IOPS 1,675 IOPS
- 2BG disks ? need 100 ? 10067 IOPS 6,700 IOPS
- 100 2GB disks with 6,700 IOPS are new bottleneck!
- Answer. I 100 2 GB disks!
51I/O System Example (contd)
- Step 3 Calculate SCSI-2 controller peak IOPS
- tSCSI-2 1 ms 16KB / (20 MB/s) 1.8ms
- SCSI-2 1/ 1.8ms 556 IOPS
- Step 4 how many disks per controller?
- 556 IOPS / 67 IOPS 8 disks per controller
- Step 5 how many controllers?
- 100 disks / (8 disks / controller) 13
controllers - Answer. II 13 controllers, 8-disks each
52Summary
- Disks
- Parameters, performance, RAID
- Buses
- I/O vs. memory
- I/O system architecture
- CPU vs. DMAC vs. IOP