Title: WideArea Cooperative Storage with CFS
1Wide-Area Cooperative Storage with CFS
- Robert Morris
- Frank Dabek, M. Frans Kaashoek,
- David Karger, Ion Stoica
- MIT and Berkeley
2Target CFS Uses
node
node
node
Internet
node
node
- Serving data with inexpensive hosts
- open-source distributions
- off-site backups
- tech report archive
- efficient sharing of music
3How to mirror open-source distributions?
- Multiple independent distributions
- Each has high peak load, low average
- Individual servers are wasteful
- Solution aggregate
- Option 1 single powerful server
- Option 2 distributed service
- But how do you find the data?
4Design Challenges
- Avoid hot spots
- Spread storage burden evenly
- Tolerate unreliable participants
- Fetch speed comparable to whole-file TCP
- Avoid O(participants) algorithms
- Centralized mechanisms Napster, broadcasts
Gnutella - CFS solves these challenges
5Why Blocks Instead of Files?
- Cost one lookup per block
- Can tailor cost by choosing good block size
- Benefit load balance is simple
- For large files
- Storage cost of large files is spread out
- Popular files are served in parallel
6The Rest of the Talk
- Software structure
- Chord distributed hashing
- DHash block management
- Evaluation
7CFS Architecture
client
server
client
server
Internet
node
node
- Each node is a client and a server (like xFS)
- Clients can support different interfaces
- File system interface
- Music key-word search (like Napster and Gnutella)
8Client-server interface
Insert file f
Insert block
FS Client
server
server
Lookup block
Lookup file f
node
node
- Files have unique names
- Files are read-only (single writer, many readers)
- Publishers split files into blocks
- Clients check files for authenticity
9Server Structure
DHash
DHash
Chord
Chord
Node 1
Node 2
- DHash stores, balances, replicates, caches
blocks - DHash uses Chord SIGCOMM 2001 to locate blocks
10Chord Hashes a Block ID to its Successor
N10
B112, B120, , B10
Block ID Node ID
N100
B100
Circular ID Space
N32
B11, B30
N80
B65, B70
N60
B33, B40, B52
- Nodes and blocks have randomly distributed IDs
- Successor node with next highest ID
11Successor Lists Ensure Robust Lookup
10, 20, 32
N5
20, 32, 40
N10
5, 10, 20
N110
32, 40, 60
N20
110, 5, 10
N99
40, 60, 80
N32
N40
60, 80, 99
99, 110, 5
N80
N60
80, 99, 110
- Each node stores r successors, r 2 log N
- Lookup can skip over dead nodes to find blocks
12Finger tables aids efficient lookup
- For a m bit key, each node n has a finger table
with m entries. - Key of each entry increases in the power of two.
- Decreases the number of message exchanges to
O(log N)
13DHash/Chord Interface
Lookup(blockID)
List of ltnode-ID, IP addressgt
DHash
server
Chord
finger table with ltnode IDs, IP addressgt
- lookup() returns list with node IDs closer in ID
space to block ID - Sorted, closest first
14Replicate blocks at r successors
N5
N10
N110
N20
N99
Block 17
N40
N80
N50
N60
N68
- Node IDs are SHA-1 of IP Address
- Ensures independent replica failure
15DHash Copies to Caches Along Lookup Path
N5
N10
N110
1.
N20
N99
2.
N40
4.
RPCs 1. Chord lookup 2. Chord lookup 3. Block
fetch 4. Send to cache
N80
N50
3.
N60
N68
Lookup(BlockID45)
16Caching at Fingers Limits Load
N32
- Only O(log N) nodes have fingers pointing to N32
- This limits the single-block load on N32
17Load Balance with virtual nodes
N60
N10
N101
N5
Node B
Node A
- Hosts may differ in disk/net capacity
- Hosts may advertise multiple IDs
- Chosen as SHA-1(IP Address, index)
- Each ID represents a virtual node
- Host load proportional to v.n.s
- Manually controlled can be made adaptive
18Quotas
- Malicious injection of large quantities of data
can use up all disk space - To prevent this, we have quotas for each
publisher - Eg. Only 2 storage space for requests from
particular IP address
19Aging and Deletion
- CFS deletes old blocks that have not been
refreshed recently to prevent aging - Publishers need to refresh their blocks if they
dont want it deleted by CFS
20How things work?
- Read Operation
- To get the 1st block of /foo
- get(public key) gt return Root-block
- Read content-hash of foos inode
- get(hash(foos inode)) gt return inode of foo
- Read content-hash of 1st block from inode
- get(hash(1st block)) gt return 1st block
21Experimental Setup (12 nodes)
To vu.nl lulea.se ucl.uk
To kaist.kr, .ve
- One virtual node per host
- 8Kbyte blocks
- RPCs use UDP
- Caching turned off
- Proximity routing turned off
22CFS Fetch Time for 1MB File
Fetch Time (Seconds)
Prefetch Window (KBytes)
- Average over the 12 hosts
- No replication, no caching 8 KByte blocks
23Distribution of Fetch Times for 1MB
24 Kbyte Prefetch
40 Kbyte Prefetch
8 Kbyte Prefetch
Fraction of Fetches
Time (Seconds)
24CFS Fetch Time vs. Whole File TCP
40 Kbyte Prefetch
Whole File TCP
Fraction of Fetches
Time (Seconds)
25CFS Summary
- CFS provides peer-to-peer r/o storage
- Structure DHash and Chord
- It is efficient, robust, and load-balanced
- It uses block-level distribution
- The prototype is as fast as whole-file TCP
- http//www.pdos.lcs.mit.edu/chord
26Thank you!