Replica Placement Strategy for WideArea Storage Systems

About This Presentation

Title:

Replica Placement Strategy for WideArea Storage Systems

Description:

Detect permanent node failures and trigger data recovery. Final Presentation:3. Assumptions ... Cumulative number of triggered data recovery v. time ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 23

Provided by: tri563

Category:

more less

Transcript and Presenter's Notes

Title: Replica Placement Strategy for WideArea Storage Systems

1
Replica Placement Strategy for Wide-Area Storage
Systems

Byung-Gon Chun and Hakim Weatherspoon
RADS Final Presentation
December 9, 2004

2
Environment

Store large quantities of data persistently and
availably
Storage Strategy
Redundancy - duplicate data to protect against
data loss
Place data throughout wide area for availability
and durability
Avoid correlated failures
Continuously repair loss redundancy as needed
Detect permanent node failures and trigger data
recovery

3
Assumptions

Data is maintained on nodes, in the wide area,
and in well maintained sites.
Sites contribute resources
Nodes (storage, cpu)
Network - bandwidth
Nodes collectively maintain data
Adaptive - Constant change, Self-organizing,
self-maintaining
Costs
Data Recovery
Process of maintaining data availability
Limit wide area bandwidth used to maintain data

4
Challenge

Avoiding correlated failures/downtime with
careful data placement
Minimize cost of resources used to maintain data
Storage
Bandwidth
Maximize
Data availability

5
Outline

Analysis of correlated failures
Show that correlated failures exist - are
significant
Effects of common subnet (admin area, geographic
location, etc)
Pick a threshold and extra redundancy
Effects of extra redundancy
Vary extra redundancy
Compare random, random w/ constraint, and oracle
placement
Show that margin between oracle and random is
small

6
Analysis of PlanetLab Trace characteristics

Trace-driven simulation
Model maintaining data on PlanetLab
Create trace using all-pairs ping
Collected from February 16, 2003 to October 6,
2004
Measure
Correlated failures v. time
Probability of k nodes down simultaneously
5th Percentile, Median number of available
replicas v. time
Cumulative number of triggered data recovery v.
time
Jeremy Stribling http//infospect.planet-lab.org/
pings

7
Analysis of PlanetLab II Correlated failures
8
Analysis I - Node characteristics
9

Analysis II- Correlated Failures
10

Correlated Failures
11

Correlated Failures (machine with downtime 1000 slots)
12

Availability Trace
13
Replica Placement Strategies

Random
RandomSite
Avoid to place multiple replicas in the same site
A site in PlanetLab is identified by 2B IP
address prefix.
RandomBlacklist
Avoid to use machines, in blacklist, that are top
k machines with long down time
RandomSiteBlacklist
Combine RandomSite and RandomBlacklist

14
Comparison of simple strategies(m1, th9, n14,
blacklist35)
15
Simulation setup

Placement Algorithm
Random vs. Oracle
Oracle strategies
Max-Lifetime-Availability
Min-Max-TTR, Min-Sum-TTR, Min-Mean-TTR
Simulation Parameters
Replication m 1, threshold th 9, total
replicas n 15
Initial repository size 2TB
Write rate 1Kbps per node and 10Kbps per node
300 storage nodes
System increases in size at rate of 3TB and 30TB
per year, respective.
Metrics
Number of available nodes
Number of data repairs

16

Comparison of simple strategies(m1, th9)
17
Results - Random Placement(1Kbps)
18
Results - Oracle Max-Lifetime-Avail(1Kbps)
19
Results - Breakdown of Random (1Kbps)
20
Results - Random(10Kbps)
21
Results - Breakdown of Random (10Kbps)
22
Conclusion

There does exist correlated downtimes.
Random is sufficient
A minimum data availability threshold and extra
redundancy is sufficient to absorb most
correlation.

Write a Comment

User Comments (0)