StarFish: highly-available block storage

About This Presentation

Title:

StarFish: highly-available block storage

Description:

StarFish: highly-available block storage Eran Gabber, Jeff Fellin, Michael Flaster, Fengrui Gu, Bruce Hillyer, Wee Teck Ng, Banu zden, and Elizabeth Shriver – PowerPoint PPT presentation

Number of Views:125

Avg rating:3.0/5.0

Slides: 39

Provided by: mfla3

Category:

more less

Transcript and Presenter's Notes

Title: StarFish: highly-available block storage

1
StarFish highly-available block storage

Eran Gabber, Jeff Fellin,
Michael Flaster, Fengrui Gu, Bruce Hillyer, Wee
Teck Ng, Banu Özden, and
Elizabeth Shriver
Computing Sciences Research
Lucent Technologies, Bell Laboratories

2
Your Data
3
Your Data
4
Your Data
5
How should you store your data?
???
6
StarFish

Geographically distributed on-the-fly
replication.
Dynamic recovery from element failures

Works with any application / OS / file-system
looks like a SCSI disk
Built on commodity hardware

7
What StarFish is Not

Distributed file system
Solution to the multiple writer problem
iSCSI

8
Status

Implemented on FreeBSD 4.x.
Live system working in lab since February 28th,
2001.
No production data was ever lost due to software
error. However, there were operator errors.
Source available to the public at
http//www.bell-labs.com/topic/swdist/

9
Outline

How Does it Work?
When Things Go Wrong
Whats a Good Configuration?
Performance Measurements

10
StarFish Architecture
Host
Disk
SCSI
11
StarFish Architecture
Host
Star Fish
SCSI
12
StarFish Architecture
Star Fish
Network
13
How it Works Writes
HE assigns seq num, then propagates
SCSI Write
SCSI Ack
HE waits for a quorum of acks
14
How it Works Reads

Depends on Read Policy
SendAll Read is sent to all active SEs, and the
HE responds to the host on the 1st reply.
SendOne Read is sent to lowest latency SE. The
HE retries if no response.

15
Outline

How Does it Work?
When Things Go Wrong
Whats a Good Configuration?
Performance Measurements

16
When Things Go Wrong SE Failures

If an out-of-date SE restarts, one of three types
of recovery will take place
Quick (HE)
Replay (SE)
Full (SE)

17
When Things Go Wrong HE Failures

Manual switchover to backup HE via SNMP command
to SEs. SEs will then reconnect to secondary HE.
SCSI connection is a single point of failure.
Partial implementation of automatic redundant
host architecture, using controllable SCSI switch.

18
When Things Slow Down Throttling

One SE will always be slower than the others.
When queues build up, the HE will delay SCSI
processing to allow SEs to keep up.
The HE will make sure that a Quorum of SEs can
keep up. Extra SEs that are too slow, even after
some throttling, are dropped.

19
Outline

How Does it Work?
When Things Go Wrong
Whats a Good Configuration?
Performance Measurements

20
Availability Definitions

Read Availability an up-to-date version of the
data is available for reading.
Write Availability the system is available to
accept new writes.

21
Choosing a Quorum Size Read Availability
22
Choosing a Quorum Size Write Availability
23
Choosing the Number of SEs
Write Availability
Q1
SE Availability
24
Outline

How Does it Work?
When Things Go Wrong
Whats a Good Configuration?
Performance Measurements

25
Measurements testbed

Local SE different machine connected to the
same GbE switch no artificial delay
Near SE artificial latency increase simulated
through dummynet
Far SE Using dummynet, simulated increased
latency, with and without bandwidth restrictions

26
Experimental Cases

Dark Fiber
Dedicated bandwidth
1ms delay is 200km, e.g. distance to neighboring
city
Internet
TCP/IP over fractional OC-3
1/3 of an OC-3 link is 51 Mbps
65ms latency reflects latency on the ATT
backbone between NY and LA.

27
Postmark
28
Performance During Recoveries
29
Related Work

High end EMC SRDF
Mid range NetApp SnapMirror
DataCore SANsymphony
Petal

30
Concluding Points

N3, Q2 is Good.
Replicas not in the quorum can have high latency
/ low bandwidth connection.
Recovery activity does not significantly degrade
performance.

31
Questions?
32
Backup Slides
33
Making a Good Configuration definitions

Consistency Writes, once acknowledged, are not
forgotten.
Write availability The system is able to accept
and acknowledge writes.
Read Availability The system is able to respond
to read requests.

34
Choosing a Quorum Size Consistency

If Q gt N/2, StarFish can only lose data if the HE
and Q SEs fail simultaneously.
Even if the Q up-to-date SEs fail after the HE
has acked the write, as long as the HE is up, it
will ensure that all SEs will get the most recent
writes.
If Q lt N/2, on restart, the HE might not have
access to current data even if Q SEs are
available.

35
Choosing the Number of SEs
For a highly available system, where Q
floor(N/2)
36
How it Works Writes