Title: The Design of a Multicastbased Distributed File System
1The Design of a Multicast-based Distributed File
System
- Björn Grönvall
- Assar Westerlund
- Stephen Pink
- Swedish Institute of Computer Science
- Luleå University of Technology
2Introduction
- Targeting for personal computing
- local FS performance
- Designed for ubiquitous file access
- local and wide area networks
- wireless networks, satellite links
- Peer-to-peer multicast communication
- no remote procedure calls
- Clients are servers
3Overview of Presentation
- General approach
- IP multicast
- Scalable Reliable Multicast (SRM)
- JetFile protocol
- Measurements
- Future work
- Summary
4Protocol approach to DFS design
- Attempt to hide the effects of
- propagation delays
- retransmission induced delays
- bandwidth induced delays
- In general, minimize traffic
- If possible, localize traffic
5Methods
- Optimistic algorithms
- Replication and multicast
- Clients act as servers
- Hoarding and prefetching
- are future work
6IP Multicast
- Shared communication channel
- Receivers announce interest
- Sender transmits only once
- Only best-effort
- unreliable delivery
- out of order delivery
7IP Multicast Comm. Channel
Router
Host
Uninterested host
8Scalable Reliable Multicast (SRM)
- By S. Floyd, V. Jacobson, et. al.
- Layered on top of IP multicast
- Scalable (like IP multicast)
- need not track receiver set
- Minimal definition of reliable
- only eventually reliable
- uses version numbers to detect packet loss
9SRM, Requests and Repairs
- Receiver-oriented protocol
- Receiver makes multicast request
- If you have the requested data
- set randomized timer
- if someone responds, cancel timer
- if timer expires, multicast repair
10JetFile Birds eye view
File managers (clients)
Storage server
Versioning servers
Network
Key server
11The JetFile instantiation of SRM
- Files are versioned
- Files are named using tuples (organization,
volume, file number,version) - Hash (organization, volume, file number) to
multicast channel
12The Basic JetFile Protocol
- Deals with data units
- status-object
- data-object
- Status-request, status-repair
- Data-request, data-repair
- additional name (offset, length)
- Version-request, version-repair
13File Contents Retrieval
- Receiver multicasts initial data-request
- Source multicasts initial data-repair
- Receiver now knows a source for data
- remaining data transferred using same protocol
but with unicast requests and repairs
14File Updates
- Write-on-close semantics
- File update implies new version number
- Commit file after first request only
- no sharing gt no immediate commit
- reuse same version number for multiple updates
- Client is server for new file version
- avoid write-through
- no sharing gt no communication
15New Version Numbers
- Versioning server assigns new version numbers to
order file updates - concurrently updated files get different version
numbers - unexpected version number gt update conflict
- data is never lost
- Best-effort multicast callbacks
- version-request is a callback
- version-repair is a callback
16Versioning example 1
File manager Network Versioning server version
is 5 open(X) New version (6) write(X)
is assigned write(X) and returned
version is 6 close(X)
17Versioning example 2
File manager Network Versioning server version
is 5 open(X)
lost write(X) write(X) close(X) New
version (6) local version is -6 open(X), version
-6 is opened is assigned read(X) close(X)
and returned
18Current Table
- Current file version numbers can be retrieved
from this table - Distributed with SRM
- table lifetime limits file staleness
- table circulates until lifetime expires
- new table is generated by versioning server
- File consistency usually as good as AFS
- can be as poor as NFS
19Andrew Benchmark, Hot Caches
Phase UFS JetFile/LAN MakeDir 1.55
1.22 CopyAll 2.68 1.56 StatAll 2.60
2.59 ReadAll 4.99 5.01 Compile 11.16 11.05
Sum 22.98 21.43
20Andrew Benchmark, Hot Caches
- Performs as expected
- performance similar to local FS
- file and dir creation different
- No synchronous network communication
- File creation decoupled from server
- independent of delays
21Andrew Benchmark,Cold Caches
Phase LAN E-WAN(rtt0.5s) MakeDir 1.25
1.28 CopyAll 3.71 50.86 StatAll 2.60
2.58 ReadAll 5.01 5.02 Compile 11.08 11.04
Sum 23.65 70.79
22Andrew Benchmark,Cold Caches
- Only CopyAll requires synchronous comm.
- Emulated WAN has round trip time 0.5s
- takes 0.5s to retrieve a one byte file !
- must fetch files before reference to get
reasonable performance (future work)
23Future Work
- Storage server
- Security and Privacy
- Hoarding and Prefetching
- Coda, SEER
- Utilize IP Multicast scope
24Summary
- Performance similar to local FS (hot cache)
- even when writing
- Protocol approach to DFS design
- SRM
- Client are servers
25More information
- bg, assar, steve_at_sics.se
- http//www.sics.se/cna/dist_app.html
26Multicast address space usage
- Routers must manage large number of multicast
addresses - Reduce address usage by
- hashing many files to same multicast address
- must filter out unwanted traffic
- potentially wastes bandwidth
- Use wakeup messages to reduce number of used
multicast addresses
27Andrew Benchmark over LAN
File sys. Warm cache Cold cache UFS 22.98,
100 N/A JetFile 21.43, 93 23.65,
103 AFS 26.49, 115 28.40, 124 NFS 29.54,
129 30.20, 131
28IP Multicast
- Channel/group represented by IP-address
- Single sender - multiple receivers
- Sender transmits only once
- efficient bandwidth utilization
- Only best effort
- packets are not ACKed
- scales well to large groups
29IP Multicast, cont.
- Routers conspire to deliver packets to networks
with receivers only - Must Join a multicast group to receive
- graft distribution tree
- Explicit Leave to leave group
- prune distribution tree
30Scalable Reliable Multicast (SRM)
- Low overhead (like IP multicast)
- need not track receiver set
- no ACKs nor NACKs are used
- Application Level Framing (ALF) oriented
- protocol extends into application
- communication performed in application defined
data units - application responsible for error recovery,
31SRM, Requirements
- All data units have persistent names
- names can't be reused
- The name always refers to the same data
- new data implies new version number
- Detect missing data through gaps in version
numbers sequence - only eventually reliable
32SRM, Request and Repair Messages
- Receiver oriented protocol. Request is multicast,
- One message can repair many hosts
- no selective retransmits
- avoids multiple requests
- can improve repair response time
33SRM, Duplicate Suppression
- Uses randomized timers to suppress simultaneous
- Deterministic suppression
- closest host respond first
- Probabilistic suppression
- de-synchronize hosts at similar distances