Taming Aggressive Replication in the Pangaea Widearea File System - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

Taming Aggressive Replication in the Pangaea Widearea File System

Description:

downlinks: Set(NodeID) ts: TimeStamp. File Creation ... Harbingers propagate down fastest links first ... Compile on C1 then time compile on C2. Pangaea ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 56

Provided by: Jas135

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Taming Aggressive Replication in the Pangaea Widearea File System

1
Taming Aggressive Replication in the Pangaea
Wide-area File System

Y. Saito, C. Kaamanolis, M. Karlsson, M.
Mahalingam Presented by Jason Waddle

2
Pangaea Wide-area File System

Support the daily storage needs of distributed
users.
Enable ad-hoc data sharing.

3
Pangaea Design Goals

Speed
Hide wide-area latency,file access time local
file system
Availability autonomy
Avoid single point-of-failure
Adapt to churn
Network economy
Minimize use of wide-area network
Exploit physical locality

4
Pangaea Assumptions (Non-goals)

Servers are trusted
Weak data consistency is sufficient (consistency
in seconds)

5
Symbiotic Design
6
Symbiotic Design
Autonomous
Each server operates when disconnected from
network.
7
Symbiotic Design
Autonomous
Cooperative
Each server operates when disconnected from
network.
When connected, servers cooperate to enhance
overall performance and availability.
8
Pervasive Replication

Replicate at file/directory level
Aggressively create replicas whenever a file or
directory is accessed
No single master replica
A replica may be read / written at any time
Replicas exchange updates in a peer-to-peer
fashion

9
Graph-based Replica Management

Replicas connected in a sparse, strongly-
connected, random graph
Updates propagate along edges
Edges used for discovery and removal

10
Benefits of Graph-based Approach

Inexpensive
Graph is sparse, adding/removing replicas O(1)
Available update distribution
As long as graph is connected, updates reach
every replica
Network economy
High connectivity for close replicas,build
spanning tree along fast edges

11
Optimistic Replica Coordination

Aim for maximum availability over strong
data-consistency
Any node issues updates at any time
Update transmission and and conflict resolution
in background

12
Optimistic Replica Coordination

Eventual consistency ( 5s in tests)
No strong consistency guaranteesno support for
locks, lock-files, etc.

13
Pangaea Structure
Region(lt5ms RTT)
Server or Node
14
Server Structure
I/O request(application)
NFS protocol handler
Pangaea server
log
Replication engine
membership
User space
Kernel space
Inter-node communication
NFS client
15
Server Modules

NFS protocol handler
Receives requests from apps, updates local
replicas, generates requests to

16
Server Modules

NFS protocol handler
Receives requests from apps, updates local
replicas, generates requests to
Replication engine
Accepts local and remote requests
Modifies replicas
Forwards requests to other nodes

17
Server Modules

NFS protocol handler
Receives requests from apps, updates local
replicas, generates requests to
Replication engine
Accepts local and remote requests
Modifies replicas
Forwards requests to other nodes
Log module
Transaction-like semantics for local updates

18
Server Modules

Membership module maintains
List of regions, their members, estimated RTT
between regions
Location of root directory replicas
Information coordinated by gossiping
Landmark nodes bootstrap newly joining nodes

Maintaining RTT information main scalability
bottleneck
19
File System Structure

Gold replicas
Listed in directory entries
Form clique in replica graph
Fixed number (e.g., 3)
All replicas (gold and bronze)
Unidirectional edges to all gold replicas
Bidirectional peer-edges
Backpointer to parent directory

20
File System Structure
/joe
/joe/foo
21
File System Structure
struct Replica fid FileID ts TimeStamp vv
VersionVector goldPeers Set(NodeID) peers
Set(NodeID) backptrs Set(FileID, String)
struct DirEntry fname String fid
FileID downlinks Set(NodeID) ts TimeStamp
22
File Creation

Select locations for g gold replicas (e.g., g3)
One on current server
Others on random servers from different regions
Create entry in parent directory
Flood updates
Update to parent directory
File contents (empty) to gold replicas

23
Replica Creation

Recursively get replicas for ancestor directories
Find a close replica (shortcutting)
Send request to the closest gold replica
Gold replica forwards request to its neighbor
closest to requester, who then sends

24
Replica Creation

Select m peer-edges (e.g., m4)
Include a gold replica (for future shortcutting)
Include closest neighbor from a random gold
replica
Get remaining nodes from random walks starting at
a random gold replica
Create m bidirectional peer-edges

25
Bronze Replica Removal

To recover disk space
Using GD-Size algorithm, throw out largest,
least-accessed replica
Drop useless replicas
Too many updates before an access (e.g., 4)
Must notify peer-edges of removal peers use
random walk to choose new edge

26
Replica Updates

Flood entire file to replica graph neighbors
Updates reach all replicas as long as the graph
is strongly connected
Optional user can block on update until all
neighbors reply (red-button mode)
Network economy???

27
Optimized Replica Updates

Send only differences (deltas)
Include old timestamp, new timestamp
Only apply delta to replica if old timestamp
matches
Revert to full-content transfer if necessary
Merge deltas when possible

28
Optimized Replica Updates

Dont send large (e.g., gt 1KB) updates to each of
m neighbors
Instead, use harbingers to dynamically build a
spanning-tree update graph
Harbinger small message with updates timestamps
Send updates along spanning-tree edges
Happens in two phases

29
Optimized Replica Updates

Exploit Physical Topology
Before pushing a harbinger to a neighbor, add a
random delay RTT (e.g., 10RTT)
Harbingers propagate down fastest links first
Dynamically builds an update spanning-tree with
fast edges

30
Update Example (Phase 1)
B
F
A
C
D
E
31
Update Example (Phase 1)
B
F
A
C
D
E
32
Update Example (Phase 1)
B
F
A
C
D
E
33
Update Example (Phase 1)
B
F
A
C
D
E
34
Update Example (Phase 1)
B
F
A
C
D
E
35
Update Example (Phase 1)
B
F
A
C
D
E
36
Update Example (Phase 2)
B
F
A
C
D
E
37
Update Example (Phase 2)
B
F
A
C
D
E
38
Update Example (Phase 2)
B
F
A
C
D
E
39
Conflict Resolution

Use a combination of version vectors and
last-writer wins to resolve
If timestamps mismatch, full-content is
transferred
Missing update just overwrite replica

40
Regular File Conflict (Three Solutions)

Last-writer-wins, using update timestamps
Requires server clock synchronization
Concatenate both updates
Make the user fix it
Possibly application-specific resolution

41
Directory Conflict
alice mv /foo /alice/foo
bob mv /foo /bob/foo
42
Directory Conflict
alice mv /foo /alice/foo
bob mv /foo /bob/foo
/bob replica set
/alice replica set
43
Directory Conflict
alice mv /foo /alice/foo
bob mv /foo /bob/foo
Let the child (foo) decide!

Implement mv as a change to the files
backpointer
Single file resolves conflicting updates
File then updates affected directories

44
Temporary Failure Recovery

Log outstanding remote operations
Update, random walk, edge addition, etc.
Retry logged updates
On reboot
On recovery of another node
Can create superfluous edges
Retains m-connectedness

45
Permanent Failures

A garbage collector (GC) scans for failed nodes
Bronze replica on failed node
GC causes replicas neighbors to replace link
with a new peer using random walk

46
Permanent Failure

Gold replica on failed node
Discovered by another gold (clique)
Chooses new gold by random walk
Flood choice to all replicas
Update parent directory to contain new gold
replica nodes
Resolve conflicts with last-writer-wins
Expensive!

47
Performance LAN
Andrew-Tcl benchmarks, time in seconds
48
Performance Slow Link
The importance of local replicas
49
Performance Roaming
Compile on C1 then time compile on C2. Pangaea
utilizes fast links to a peers replicas.
50
Performance Non-uniform Net
A model of HPs corporate network.
51
Performance Non-uniform Net
52
Performance Update Propagation
Harbinger time is the window of inconsistency.
53
Performance Large Scale
HP 3000 Node 7-region HP Network U 500 regions,
6 Nodes per region, 200ms RTT 5Mb/s
Latency improves with more replicas.
54
Performance Large Scale
HP 3000 Node 7-region HP Network U 500 regions,
6 Nodes per region, 200ms RTT 5Mb/s
Network economy improves with more replicas.
55
Performance Availability
Numbers in parenthesis are relative storage
overhead.

Write a Comment

User Comments (0)