Title: Large Scale Sharing GFS and PAST
1Large Scale Sharing GFS and PAST
2Distributed File Systems
- Traditional Definition
- Data and/or metadata stored at remote locations,
accessed by client over the network. - Various degrees of centralization from NFS to
xFS. - GFS and PAST
- Unconventional, specialized functionality
- Large-scale in data and nodes
3The Google File System
- Specifically designed for Googles backend needs
- Web Spiders append to huge files
- Application data patterns
- Multiple producer multiple consumer
- Many-way merging
- GFS ?? Traditional File Systems
4Design Space Coordinates
- Commodity Components
- Very large files Multi GB
- Large sequential accesses
- Co-design of Applications and File System
- Supports small files, random access writes and
reads, but not efficiently
5GFS Architecture
- Interface
- Usual create, delete, open, close, etc
- Special snapshot, record append
- Files divided into fixed size chunks
- Each chunk replicated at chunkservers
- Single master maintains metadata
- Master, Chunkservers, Clients Linux
workstations, user-level process
6Client File Request
- Client finds chunkid for offset within file
- Client sends ltfilename, chunkidgt to Master
- Master returns chunk handle and chunkserver
locations
7Design Choices Master
- Single master maintains all metadata
- Simple Design
- Global decision making for chunk replication
- and placement
- Bottleneck?
- Single Point of Failure?
8Design Choices Master
- Single master maintains all metadata in memory!
- Fast master operations
- Allows background scans of entire data
- Memory Limit?
- Fault Tolerance?
9Relaxed Consistency Model
- File Regions are -
- Consistent All clients see the same thing
- Defined After mutation, all clients see exactly
what the mutation wrote - Ordering of Concurrent Mutations
- For each chunks replica set, Master gives one
replica primary lease - Primary replica decides ordering of mutations and
sends to other replicas
10Anatomy of a Mutation
- 1 2 Client gets chunkserver locations from
master - 3 Client pushes data to replicas, in a chain
- 4 Client sends write request to primary
primary assigns sequence number to write and
applies it - 5 6 Primary tells other replicas to apply write
- 7 Primary replies to client
11Connection with Consistency Model
- Secondary replica encounters error while applying
write (step 5) region Inconsistent. - Client code breaks up single large write into
multiple small writes region Consistent, but
Undefined.
12Special Functionality
- Atomic Record Append
- Primary appends to itself, then tells other
replicas to write at that offset - If secondary replica fails to write data (step
5), - duplicates in successful replicas, padding in
failed ones - region defined where append successful,
inconsistent where failed - Snapshot
- Copy-on-write chunks copied lazily to same
replica
13Master Internals
- Namespace management
- Replica Placement
- Chunk Creation, Re-replication, Rebalancing
- Garbage Collection
- Stale Replica Detection
14Dealing with Faults
- High availability
- Fast master and chunkserver recovery
- Chunk replication
- Master state replication read-only shadow
replicas - Data Integrity
- Chunk broken into 64KB blocks, with 32 bit
checksum - Checksums stored in memory, logged to disk
- Optimized for appends, since no verifying required
15Micro-benchmarks
16Storage Data for real clusters
17Performance
18Workload Breakdown
of operations for given size
of bytes transferred for given operation size
19GFS Conclusion
- Very application-specific more engineering than
research
20PAST
- Internet-based P2P global storage utility
- Strong persistence
- High availability
- Scalability
- Security
- Not a conventional FS
- Files have unique id
- Clients can insert and retrieve files
- Files are immutable
21PAST Operations
- Nodes have random unique nodeIds
- No searching, directory lookup, key distribution
- Supported Operations
- Insert (name, key, k, file) ? fileId
- Stores on k nodes closest in id space
- Lookup (fileId) ? file
- Reclaim (fileId, key)
22Pastry
- P2P routing substrate
- route (key, msg) routes to numerically closest
node in less than log2b N steps - Routing Table Size (2b - 1) log2b N 2l
- b determines tradeoff between per node state
and lookup order - l failure tolerance delivery guaranteed unless
l/2 adjacent nodeIds fail
2310233102 Routing Table
- L/2 larger and L/2 smaller nodeIds
- Routing Entries
- M closest nodes
24PAST operations/security
- Insert
- Certificate created with fileId, file content
hash, replication factor and signed with private
key - File and certificate routed through Pastry
- First node in k closest accepts file and forwards
to other k-1 - Security Smartcards
- Public/Private key
- Generate and verify certificates
- Ensure integrity of nodeId and fileId assignments
25Storage Management
- Design Goals
- High global storage utilization
- Graceful degradation near max utilization
- PAST tries to
- Balance free storage space amongst nodes
- Maintain k closest nodes replication invariant
- Storage Load Imbalance
- Variance in number of files assigned to node
- Variance in size distribution of inserted files
- Variance in storage capacity of PAST nodes
26Storage Management
- Large capacity storage nodes have multiple
nodeIds - Replica Diversion
- If node A cannot store file, it stores pointer to
file at leaf set node B which is not in k closest - What if A or B fail? Duplicate pointer in k1
closest node - Policies for directing and accepting replicas
tpri and tdiv thresholds for file size / free
space. - File Diversion
- If insert fails, client retries with different
fileId
27Storage Management
- Maintaining replication invariant
- Failures and joins
- Caching
- k-replication in PAST for availability
- Extra copies stored to reduce client latency,
network traffic - Unused disk space utilized
- Greedy Dual-Size replacement policy
28Performance
- Workloads
- 8 Web Proxy Logs
- Combined file systems
- k5, b4
- of nodes 2250
- Without replica and file diversion
- 51.1 insertions failed
- 60.8 global utilization
4 normal distributions of node storage sizes
29Effect of Storage Management
30Effect of tpri
Lower tpri Better utilization, More failures
tdiv 0.05 tpri varied
31Effect of tdiv
Trend similar to tpri
tpri 0.1 tdiv varied
32File and Replica Diversions
Ratio of replica diversions vs utilization
Ratio of file diversions vs utilization
33Distribution of Insertion Failures
File system trace
Web logs trace
34Caching
35Conclusion