Title: A Locality Preserving Decentralized File System
1A Locality Preserving Decentralized File System
- Jeffrey Pang
- Haifeng Yu
- Phil Gibbons
- Michael Kaminsky
- Srini Seshan
2Project Intro
- Defragmenting DHT data layout for
- Improved availability for entire tasks
- Amortize data lookup latency
Current DHT Data Layout random placement
Defragmented DHT Data Layout sequential placement
- Typical Task/Operation Sizes
- 30-65 access gt10 8kb-blocks
- 8-30 access gt100 8kb-blocks
3Background
- EXISTING DHT STORAGE SYSTEMS
- Each server responsible for pseudo-random range
of ID space - Object are given pseudo-random IDs
324
987
160
211-400
401-513
150-210
800-999
4Project Overview
- Goal Produce a decentralized read-mostly
filesystem with following properties - Sequential layout of related data
- Amortized lookup latency
- Improved availability
- Some Challenges
- Load balancing
- Download throughput
- Project Focus
- System design and implementation
Current DHT Data Layout random placement
Defragmented DHT Data Layout sequential
placement
5Overview
- Background Motivation
- Preserving Object Locality
- Dynamic Load Balancing
- Results
- Future Work
6Preserving Object Locality
- Motivation
- Fate sharing all objects in a single operation
are more likely to be available at once - Effective caching/prefetching servers Ive
contacted recently are more likely to have what I
want next - Design options
- Namespace locality (e.g., filesystem hierarchy)
- Dynamic clustering (e.g., based on observed
access patterns)
7Is Namespace Locality Good Enough?
- Initial trace evaluation
- Workloads
- HP block-level disk trace (1999)
- Harvard research NFS trace (2003)
- NLANR webcache trace (2003)
- Setup
- Order files alphabetically according to filepath
- 10,000 data blocks/server
- Calculate failure prob. of each operation
- Node failure probability of 5
- 3 replicas
8Estimated Availability Across Workloads
9Encoding Object Names
160 bits
Traditional DHT key encoding
SHA-1 Hash
SHA1(data)
data
- Leverage
- Large key space (amortized cost over wide-area is
minimal) - Workload properties (e.g., 99 of the time
directory depth lt 12) - Corner cases
- Depth or width overflow use 1 bit to signal
overflow region and just use SHA1(filepath)
10Encoding Object Names
Bill
6
userid
path encode
blockid
Docs
6
1
0
bid
1
6
1
1
bid
1
6
1
2
bid
2
Bob
7
570-600
601-660
661-700
11Dynamic Load Balancing
- Motivation
- Hash function is no longer uniform
- Uniform ID assignments to nodes leads to load
imbalance - Design options
- Simple item balancing (MIT)
- Mercury (CMU)
storage load
node number
Load balance with 1024 nodes using the Harvard
trace
12Load Balancing Algorithm
- Basic Idea
- Contact a random node in the ring
- If myLoad gt deltahisLoad (or vis versa), the
lighter node changes its ID to move before the
heavier node. - Heavy nodes load splits in two.
- Node load within factor of 4 in O(log(n)) steps
- Mercury optimizations
- Continuous sampling of load around the ring
- Use estimated load histogram to do informed probes
13Handling Temporary Resource Constraints
- Drastic storage distribution changes can cause
frequent data movement - Node storage can be temporarily constrained
(i.e., no more disk space) - Solution
- Lazy data movement
- Node responsible for a key keeps a pointer to
actual data blocks - Data blocks can be stored anywhere in system
14Handling Temporary Resource Constraints
data
data
WRITE
data
15Results
- How much improvement in availability and lookup
latency can we expect? - Setup
- Trace-based simulation with Harvard trace
- File blocks named using our encoding scheme
- Same availability calculation as before
- Clients keep open connections to 1-100 of the
most recently contacted data servers - 1024 servers
16Potential Reduction in Lookups
17Potential Availability Improvement
Random (expected) Ordered (unif) Optimal
- Encoding has nearly identical failure prob as the
alphabetical encoding (differs by 0.0002)
18Results
- What is the overhead of load balancing?
- Setup
- Simulated load balancing with Harvard trace
- 1024 servers
- Each load balance step uses a histogram estimated
with 4 random samples
initial distribution
19Load Balance Over Time
20Data Migration Overhead
21Related Work
- Namespace Locality
- Cylinder group allocation FFS
- Co-locating datameta-data C-FFS
- Isolating user data in clusters Archipelago
- Namespace flattening in object based storage
Self- - Load Balancing Data Indirection
- DHT Item Balancing SkipNets, Mercury
- Data Indirection Total Recall
22(Near) Future Work
- Finish up implementation
- Currently finished
- data block, storage/retrieval data indirection,
some load balancing - Still requires
- Some debugging
- Interfacing with NFS filesystem loopback
- Evaluation on real testbeds (Emulab and
Planetlab) - Targeting submission for NSDI 2006
- Mid-October deadline