Title: StarFish: highly-available block storage
1StarFish highly-available block storage
- Eran Gabber, Jeff Fellin,
- Michael Flaster, Fengrui Gu, Bruce Hillyer, Wee
Teck Ng, Banu Özden, and - Elizabeth Shriver
- Computing Sciences Research
- Lucent Technologies, Bell Laboratories
2Your Data
3Your Data
4Your Data
5How should you store your data?
???
6StarFish
- Geographically distributed on-the-fly
replication. - Dynamic recovery from element failures
- Works with any application / OS / file-system
looks like a SCSI disk - Built on commodity hardware
7What StarFish is Not
- Distributed file system
- Solution to the multiple writer problem
- iSCSI
8Status
- Implemented on FreeBSD 4.x.
- Live system working in lab since February 28th,
2001. - No production data was ever lost due to software
error. However, there were operator errors. - Source available to the public at
- http//www.bell-labs.com/topic/swdist/
9Outline
- How Does it Work?
- When Things Go Wrong
- Whats a Good Configuration?
- Performance Measurements
10StarFish Architecture
Host
Disk
SCSI
11StarFish Architecture
Host
Star Fish
SCSI
12StarFish Architecture
Star Fish
Network
13How it Works Writes
HE assigns seq num, then propagates
SCSI Write
SCSI Ack
HE waits for a quorum of acks
14How it Works Reads
- Depends on Read Policy
- SendAll Read is sent to all active SEs, and the
HE responds to the host on the 1st reply. - SendOne Read is sent to lowest latency SE. The
HE retries if no response.
15Outline
- How Does it Work?
- When Things Go Wrong
- Whats a Good Configuration?
- Performance Measurements
16When Things Go Wrong SE Failures
- If an out-of-date SE restarts, one of three types
of recovery will take place - Quick (HE)
- Replay (SE)
- Full (SE)
17When Things Go Wrong HE Failures
- Manual switchover to backup HE via SNMP command
to SEs. SEs will then reconnect to secondary HE. - SCSI connection is a single point of failure.
- Partial implementation of automatic redundant
host architecture, using controllable SCSI switch.
18When Things Slow Down Throttling
- One SE will always be slower than the others.
- When queues build up, the HE will delay SCSI
processing to allow SEs to keep up. - The HE will make sure that a Quorum of SEs can
keep up. Extra SEs that are too slow, even after
some throttling, are dropped.
19Outline
- How Does it Work?
- When Things Go Wrong
- Whats a Good Configuration?
- Performance Measurements
20Availability Definitions
- Read Availability an up-to-date version of the
data is available for reading. - Write Availability the system is available to
accept new writes.
21Choosing a Quorum Size Read Availability
22Choosing a Quorum Size Write Availability
23Choosing the Number of SEs
Write Availability
Q1
SE Availability
24Outline
- How Does it Work?
- When Things Go Wrong
- Whats a Good Configuration?
- Performance Measurements
25Measurements testbed
- Local SE different machine connected to the
same GbE switch no artificial delay - Near SE artificial latency increase simulated
through dummynet - Far SE Using dummynet, simulated increased
latency, with and without bandwidth restrictions
26Experimental Cases
- Dark Fiber
- Dedicated bandwidth
- 1ms delay is 200km, e.g. distance to neighboring
city - Internet
- TCP/IP over fractional OC-3
- 1/3 of an OC-3 link is 51 Mbps
- 65ms latency reflects latency on the ATT
backbone between NY and LA.
27Postmark
28Performance During Recoveries
29Related Work
- High end EMC SRDF
- Mid range NetApp SnapMirror
- DataCore SANsymphony
- Petal
30Concluding Points
- N3, Q2 is Good.
- Replicas not in the quorum can have high latency
/ low bandwidth connection. - Recovery activity does not significantly degrade
performance.
31Questions?
32Backup Slides
33Making a Good Configuration definitions
- Consistency Writes, once acknowledged, are not
forgotten. - Write availability The system is able to accept
and acknowledge writes. - Read Availability The system is able to respond
to read requests.
34Choosing a Quorum Size Consistency
- If Q gt N/2, StarFish can only lose data if the HE
and Q SEs fail simultaneously. - Even if the Q up-to-date SEs fail after the HE
has acked the write, as long as the HE is up, it
will ensure that all SEs will get the most recent
writes. - If Q lt N/2, on restart, the HE might not have
access to current data even if Q SEs are
available.
35Choosing the Number of SEs
For a highly available system, where Q
floor(N/2)
36How it Works Writes
- Single-owner (the HE) access semantics
- Writes are assigned sequence numbers by the HE.
- Each SE applies them in order. Any gap
necessitates an SE recovery - No reordering/coalescing/optimizations
- HE acks the write to the host once a quorum of
SEs report its been committed.
37Read Performance
38Write LatencyN3