Title: DStore:%20An%20Easy-to-Manage%20Persistent%20State%20Store
1DStore An Easy-to-Manage Persistent State Store
Andy Huang and Armando FoxStanford University
2Outline
- Project overview
- Consistency guarantees
- Failure detection
- Benchmarks
- Next steps and bigger picture
3Background Scalable CHTs
Frontends
App Servers
DBs
Cluster hash tables (CHTs)
- Single-key-lookup data
- Yahoo! user profiles
- Amazon catalog metadata
- Underlying storage layer
- InktomiwordID ? docID listdocID ? document
metadata - DDS/Ninjaatomic compare-and-swap
4DStore An easy-to-manage CHT
- Capacity planning
- High scaling costs necessitate accurate load
prediction
- Failure detection
- Fast detection is at odds with accurate detection
C H A L L E N G E S
- Cheap recovery
- Predictably fast and predictably small impact on
availability/performance
- Our online repartitioning algorithm lowers
scaling cost - Reactive scaling adjusts capacity to match
current load
- Lowers the cost of acting on false positive
- Effective failure detection not contingent on
accuracy
B E N E F I T S
Manage like stateless frontends
5Cheap recovery Principles and costs
- Single-phase writes
- No locking and transactional logging
- Quorums
- No recovery code to freeze writes copy missed
updates
T E C H N I Q U E S
- Sacrifice some consistency Well-defined
guarantees that provide consistent ordering
- Higher replication factor 2N1 bricks to
tolerate N failures (vs. N1 in ROWA)
C O S T S
Trade storage and consistency for cheap recovery
6Nothing new under the sun, but
DStore
Prior work
Technique
Ease of management
Scalable performance
CHT
Availability during failures and recovery
Availability during network partitions and
Byzantine faults
Quorums
Availability and performance while nodes are
unavailable
Relaxed consistency
Cheap recovery (but thats just the start)
High availability and performance (end goal)
Result
7Cheap recovery simplifies state management
DStore
Prior work
Challenge
Effective even if it is not highly accurate
Difficult to make fast and accurate
Failure detection
Duration and impact is predictably small
Relatively new area Aqueduct
Online repartitioning
Scale reactively based on current load
Predict future load
Capacity planning
Future work
RAID
Data reconstruction
Manage state with techniques used for stateless
frontends
State management is costly (administration- and
availability-wise)
Result
8Outline
- Project overview
- Consistency guarantees
- Failure detection
- Benchmarks
- Next steps and bigger picture
9Consistency guarantees
- Usage model
- Guarantee For a key k, DStore enforces a global
order of operations that is consistent with the
order seen by individual clients. - C1 issues w1(k, vnew) to replace current hash
table entry (k, vold) - w1 returns SUCCESS subsequent reads return vnew
- w1 returns FAIL subsequent reads return vold
- w1 return UNKNOWN (due to Dlib failure) two
cases
10Case 1 Another user U2 performs a read
(k1,vold)
U2 r(k1) returns vold no user has read
vnew vnew no user will later read vold
Dlib failure can cause a partial write, violating
the quorum property
If timestamps differ, read-repair restores
majority invariant
U1
B1
B2
B3
U2
11Case 2 U1 performs a read
(k1,vold)
U1 r(k1) write is immediately committed or
aborted all future readers see either vold or
vnew
A write-in-progress cookie can be used to detect
partial writes and commit/abort on the next read
B1
B2
B3
U1
U2
12Consistency guarantees
- C1 issues w1(k, vnew) to replace current hash
table entry (k, vold) - w1 returns SUCCESS subsequent reads return vnew
- w1 returns FAIL subsequent reads return vold
- w1 return UNKNOWN (due to Dlib failure)
- U1 reads w1 is immediately committed or aborted
- U2 reads if vold is returned, no user has read
vnew if vnew is returned, no user
will later read vold
13Versus sequential consistency
(k1,vold)(k2,vold)
Conditions atomicity consistent ordering
w1(k1,vnew)
UNKNOWN causes non-atomic writes
U1
B1
B2
B3
U2
14Two-phase commit vs. single phase writes
Single-phase writes
2-phase commit
Property
Consistent ordering
Sequential consistency
Consistency
No special-case recovery
Read log to complete in progress transactions
Recovery
No locking
Locking may cause request to block during
failures
Availability
1 synchronous update1 roundtrip
2 synchronous log writes2 roundtrips
Performance
Read-repair (spreads out the cost of 2-PC to
make common case faster)Write-in-progress
cookie (spreads out the responsibility of
2-PC)
None
Other costs
15Recovery behavior
Predictably fast and small impact
16Application-generic failure detection
Failure detection techniques
Operating statistics (CPU load, requests
processed, etc.)
Anomalies
Beacon listener
gt treshold
Median absolute deviation
reboot
Simple detection techniques work because
resolution mechanism is cheap
17Failure detection and repartitioning behavior
Aggressive failure detection
Low scaling cost
Low cost of acting on false positives
18Bigger picture What is self-managing?
Indicator
Brick performance
a sign of system health
Monitoring
tests for potential problems
Treatment
low-impact resolution mechanism
19Bigger picture What is self-managing?
20Bigger picture What is self-managing?
Brick performance
System load
Disk failures
Simple detection mechanisms policies
Key low-cost mechanisms
Constant recovery
21Nothing new under the sun, but
Technique Prior work DStore
CHT Scalable performance Ease of management
Quorums Availability during network partitions and Byzantine faults Availability during failures and recovery
Relaxed consistency Availability and performance while nodes are unavailable Availability during failures and recovery
Result High availability and performance (end goal) Cheap recovery (but thats just the start)
22Cheap recovery simplifies state management
Challenge Prior work DStore
Failure detection Difficult to make fast and accurate Effective even if it is not highly accurate
Online repartitioning Relatively new area Aqueduct Duration and impact is predictably small
Capacity planning Predict future load Scale reactively based on current load
Data reconstruction RAID Future work
Result State management is costly (administration- and availability-wise) Manage state with techniques used for stateless frontends
23Two-phase commit vs. single phase writes
Property 2-phase commit Single-phase writes
Consistency Sequential consistency Consistent ordering
Recovery Read log to complete in progress transactions No special-case recovery
Availability Locking may cause request to block during failures No locking
Performance 2 synchronous log writes2 roundtrips 1 synchronous update1 roundtrip
Other costs None Read-repair (spreads out the cost of 2-PC to make common case faster)Write-in-progress cookie (spreads out the responsibility of 2-PC)
24Bigger picture
25Big picture
- Use simple metrics to trigger scaling
- Brick load
- Cache hit rate
- Online data reconstruction
26Simple, aggressive failure detection
- Bricks send operating statistics
- CPU load, average queue delay, number of requests
processed, etc. - Statistical methods
- Median absolute deviation compares one bricks
behavior with the current behavior of the rest of
the bricks - Tarzan incorporates past behavior of each brick
and detects anomalies in the operating
statistics patterns - Why these techniques are effective
- Not the best failure detection mechanisms
- Parameters are not highly tuned
- Simple, application-generic techniques work
because of the low cost of acting on false
positives