Title: Yasushi Saito
1Functionally Homogeneous Clustering A New
Architecture for Scalable Data-intensive
Internet Services
- Yasushi Saito
- yasushi_at_cs.washington.edu
University of Washington Department of Computer
Science and Engineering, Seattle, WA
2Goals
- Use cheap, unreliable hardware components to
build scalable data-intensive Internet services. - Data-intensive Internet services email, BBS,
calendar, etc.. - Three facets of scalability ...
- Performance linear increase with system size.
- Manageability react to changes automatically.
- Availability survive failures gracefully.
-
3Contributions
- Functional homogeneous clustering
- Dynamic data and function distribution.
- Exploitation of application semantics.
- Three techniques
- Naming and automatic recovery.
- High-throughput optimistic replication.
- Load balancing.
- Email as the first target application.
- Evaluation of the architecture using Porcupine.
4Presentation Outline
- Introduction.
- What are Internet data-intensive services?
- Existing solutions and their problems.
- Functional homogeneous clustering.
- Challenges and solutions.
- Performance scaling.
- Reacting to failures and recoveries.
- Deciding on data placement.
- Conclusion.
5Data-intensive Internet services
- Examples Email, Usenet, BBS, calendar, Internet
collaboration (photobook, equill.com, crit.org). - Growing rapidly as demand for personal services
grows.
- High update frequency
- Low access locality.
- Web techniques (caching, stateless data
transformation) not effective. - Weak data consistency requirements.
- Well-defined structured data access path.
- Embarrassingly parallel.
- ? RDB overkill.
6Rationale for Email
- Email as the first target application.
- Most important among data-intensive services.
- Service concentration (Hotmail, AOL,, ...).
- ? Practical demands.
- The most update-intensive.
- No access locality.
- ? Challenging application.
- Prototype implementation
- Porcupine email server.
7Conventional SolutionsBig Iron
- Just buy a big machine
- Easy deployment
- Easy management
- Limited scalability
- Single failure domain
- - Really expensive
8Conventional SolutionsClustering
- Connect many small machines
- Cheap
- Incremental scalability
- Natural failure boundary
- Software managerial complexity.
9Existing Cluster Solutions
- Static partitioning assign data function to
nodes statically. - Management problems
- Manual data partition.
- Performance problems
- No dynamic load balancing.
- Availability problems
- Limited fault tolerance.
10Presentation Outline
- Introduction
- Functionally homogeneous clustering
- Key concepts.
- Key techniques recovery, replication, load
balancing. - Basic operations and data structures.
- Challenges and solutions
- Evaluation
- Conclusion
11Functionally Homogeneous Clustering
- Clustering is the way to go.
- Static function and data partitioning leads to
the problems. - So, make everything dynamic
- Any node can handle any task (client interaction,
user management, etc). - Any node can store any piece of data (email
messages, user profile).
12Advantages
- Advantages
- Better load balance, hot spot dispersion.
- Support for heterogeneous clusters.
- Automatic reconfiguration and task
re-distribution upon node failure/recovery. Easy
node addition/retirement. - Results
- Better Performance.
- Better Manageability.
- Better Availability.
13Challenges
- Dynamic function distribution
- Solution run every function on every node.
-
- Dynamic data distribution
- How are data named and located?
- How are data placed?
- How do data survive failures?
14Key Techniques and Relationships
Functional Homogeneity
Framework
Load Balancing
Name DB w/ Reconfiguration
Techniques
Replication
Goals
Manageability
Performance
Availability
15Overview Porcupine
Replication Manager
Mail map
Email msgs
User profile
16Receiving Email in Porcupine
Protocol handling
User lookup
Load Balancing
Data store (replication)
C
D
A
DNS-RR selection
1. send mail to bob
4. OK, bob has msgs on C, D, E
7. Store msg
3. Verify bob
6. Store msg
...
...
A
B
C
D
B
5. Pick the best nodes to store new msg ? C,D
2. Who manages bob? ? A
17Basic Data Structures
bob
hash(bob) 2
User map
B
C
A
C
A
B
A
C
B
C
A
C
A
B
A
C
Mail map / user profile
bob A,C
suzy A,C
joe B
ann B
Bobs MSGs
Suzys MSGs
Bobs MSGs
Joes MSGs
Anns MSGs
Suzys MSGs
Mailbox storage
A
B
C
18Presentation Outline
- Overview
- Functionally homogeneous clustering
- Challenges and solutions
- Scaling performance
- Reacting to failures and recoveries
- Recovering name space
- Replicating of on-disk data
- Load balancing
- Evaluation
- Conclusion
19Scaling Performance
- User map distributes user management
responsibility evenly to nodes. - Load balancing distributes data storage
responsibility evenly to nodes. - Workload is very parallel.
- ? Scalable performance.
20Measurement Environment
- Porcupine email server.
- Linux-2.2.7glibc-2.1.1ext2.
- 50,000 lines of C code.
- 30 node cluster of not-quite-all-identical PCs.
- 100Mb/s Ethernet 1Gb/s hubs.
- Performance disk-bound..
- Homogeneous configuration.
- Synthetic load
- Modeled after UWCSE server.
- Mixture of SMTP and POP sessions.
21Porcupine Performance
POP performance, no email replication
68m/day
25m/day
22Presentation Outline
- Overview
- Functionally homogeneous clustering
- Challenges and solutions
- Scaling performance
- Reacting to failures and recoveries
- Recovering name space
- Replicating of on-disk data
- Load balancing
- Evaluation
- Conclusion
23How Do Computers Fail?
- Large clusters are unreliable.
- Assumption live nodes respond correctly in
bounded time time most of the time. - Network can partition
- Nodes can become very slow temporarily.
- Nodes can fail (and may never recover).
- Byzantine failures excluded.
24Recovery Goals and Strategies
- Goals
- Maintain function after unusual failures.
- React to changes quickly.
- Graceful performance degradation / improvement.
- Strategy Two complementary mechanisms.
- Make data soft as much as possible.
- Hard state email messages, user profile.
- ? Optimistic fine-grain replication.
- Soft state user map, mail map.
- ? Reconstruction after configuration change.
25Soft-state Recovery Overview
2. Distributed disk scan
1. Membership protocol Usermap recomputation
B
A
A
B
A
B
A
B
B
A
A
B
A
B
A
B
A
bob A,C
bob A,C
bob A,C
suzy A,B
suzy
B
A
A
B
A
B
A
B
B
A
A
B
A
B
A
B
B
joe C
joe C
joe C
ann B
ann
suzy A,B
C
suzy A,B
suzy A,B
ann B,C
ann B,C
ann B,C
Timeline
26Cost of Soft State Recovery
- Data bucketing allows fast discovery.
- Cost of a bucket scan
- ? O(U).
- of buckets scanned.
- ? O(1/N).
- Freq. of changes.
- ? O(N/MTBF).
- Total cost.
- O(U/MTBF) .
U5 million per node
27How does Porcupine React to Configuration Changes?
See breakdown
28Soft state recovery Summary
- Scalable reliable recovery.
- Quick, constant-cost recovery.
- Recover soft state after any type/number of
failures. - No residual references to dead nodes.
- Proven correct
- Soft state will eventually and correctly reflect
the contents in the disk.
29Replicating Hard State
- Goals
- Keep serving hard state (email msgs, user
profile) after unusual failures. - Per-object site replica-site selection.
- Space- and computational- efficiency.
- Dynamic addition/removal of replicas.
- Strategy Exploit application semantics.
- Be optimistic.
- Whole state transfer Thomas write rule.
30Example Update Propagation
Object contents
Retire 310pm
310pm
A
310pm
Ack 310pm
C
A
A
Replica set
A
B
C
A
B
C
B
A
B
C
31Example Update Propagation
Object contents
A
C
Replica set
A
B
C
A
B
C
B
310pm
310pm
A
A
C
A
B
C
Timestamp
Ack set
32Example Update Propagation
A
C
A
B
C
A
B
C
B
310pm
310pm
A
A
C
A
B
C
310pm
A
B
33Example Update Propagation
A
C
A
B
C
A
B
C
B
310pm
310pm
A
A
B
C
A
C
A
B
C
310pm
A
B
34Example Update Propagation
A
Ack 310pm
C
A
B
C
A
B
C
B
310pm
310pm
A
A
B
C
A
C
A
B
C
310pm
A
B
Ack 310pm
35Example Update Propagation
A
C
A
B
C
A
B
C
B
310pm
310pm
A
B
C
A
C
A
B
C
310pm
A
B
36Example Update Propagation
Retire 310pm
A
C
A
B
C
A
B
C
B
310pm
310pm
A
B
C
A
C
A
B
C
310pm
A
B
Retire 310pm
37Example Update Propagation
A
C
A
B
C
A
B
C
B
A
B
C
38Replica Addition and Removal
A issues an update to delete C
A
B
A
B
C
B
- Unified treatment of updates to contents and
to replica-set.
310pm
A
B
A
B
C
A
B
C
C
A
A
New replica set
Target set
Ack set
39What If Updates Conflict?
- Apply Thomas write rule.
- Newest update always wins.
- Older update canceled by overwriting by the newer
update. - Same rule applied to replica addition/deletion.
- But some subtleties...
40Node Discovery Protocol
D
A
B
A
B
C
320pm
310pm
A
B
D
A
B
C
A
B
C
D
C
A
B
A
B
D
A
B
C
D
B
A
310pm
320pm
Apply 320 update
A
B
A
B
C
D
A
B
C
A
B
D
C
Add targets C
A
B
A
B
New replica set
Target set
Ack set
41Replication Space Overhead
Spool size2GB, Avg email msg4.7KB
42How Efficient is Replication?
43Replication Summary
- Flexibility
- Allow any object to be stored on any node.
- Support dynamic replica-set changes.
- Simplicity efficiency
- Two-phase propagation/retirement.
- Unifying contents- and replica-set updates.
- Proven correct
- All live replicas agree on the newest contents,
regardless of of concurrent updates and of
failures. - When network do not partition for long period.
44Presentation Outline
- Overview
- Functionally homogeneous clustering
- Challenges and solutions
- Reacting to failures and recoveries
- Soft state namespace recovery
- Replication
- Load balancing
- Conclusion
45Distributing Incoming Workload
Users mail map
- Goals
- Minimize voodoo parameter tuning.
- Handle skewed configuration.
- Handle skewed workload.
- Lightweight.
- Reconcile affinity load balance.
- Strategy Local spread-based load balancing.
- Spread soft limit on mailmap . .
- Load measure of pending disk I/O requests.
1. Add nodes if mailmapltspread
Nodes load (cached)
2. Pick least loaded node(s) from the set
46Choosing the Optimal Spread Limit
- Trade off
- Large spread ? more nodes to choose from.
- ? better load balance.
- Small spread ? fewer files to access.
- ? better overall throughput.
- Spread 2 optimal for uniform configuration.
- Spread gt 2 (e.g., 4) for heterogeneous
configuration.
47How Well does Porcupine Support Heterogeneous
Clusters?
16.8m/day (25)
0.5m/day (0.8)
48Presentation Outline
- Overview
- Functionally homogeneous clustering
- Challenges and solutions
- Evaluation
- Conclusion
- Summary
- Future directions
49Conclusions
- Cheap, fast, available, and manageable clusters
can be built for data-intensive Internet
services. - Key ideas can be extended beyond mail.
- Dynamic data and function distribution.
- Automatic reconfiguration.
- High-throughput, optimistic replication.
- Load balancing.
- Exploiting application semantics.
- Use of soft state.
- Optimism.
50Future Directions
- Geographical distribution.
- Running multiple services.
- Software reuse.
51Example Replica Removal
A
C
B
A
B
C
A
B
C
310pm
A
B
C
A
C
52Example Replica Removal
A
C
B
A
B
A
B
C
310pm
310pm
A
B
C
A
B
C
A
C
A
B
A
Targets
New replica set
Ack set
53Example Replica Removal
A
C
B
A
B
A
B
C
A
B
C
54Example Replica Removal
A
C
B
A
B
310pm
A
B
A
C
310pm
A
B
C
A
B
A
B
55Example Updating Contents
Object contents
Replica set
A
B
C
A
B
C
Timestamp
C
Ack set
310pm
A
A
A
B
C
Update record (exists only during update
propagation)
B
56Example Update Propagation
A
B
C
A
B
C
310pm
310pm
A
A
C
A
B
C
A
C
310pm
A
B
B
57Update Retirement
Retire 310pm
A
B
C
A
B
C
310pm
310pm
A
B
C
A
C
A
B
C
A
C
310pm
A
B
Retire 310pm
B
58Example Final State
- Algorithm quiescent after update retirement
- New contents absent from the update record
- Contents are read from replica directly
- Update stored only during propagation
- Computational space efficiency
A
A
B
C
B
A
B
C
C
A
B
C
59Handling Long-term Failures
- Algorithm maintains consistency of remaining
replicas. - But updates will get stuck and clog nodes
disks. - Solution erase dead nodes names from replica
sets update records after the grace period.
60Replication Space Overhead
? 6-17 MB for replica setupdate records.
? 2000 MB for email msgs
61Scaling to Large User Population
- Large user population increases the memory
requirement. - Recovery cost grows linearly w/ per-node user
population.
62Rebalancing
- Load balancing may cause suboptimal data
distribution after node addition/retirement. - Resource wasted during night (1/2 to 1/5
traffic). - Rebalancer
- Runs during midnight.
- Adds replicas for under-replicated objects.
- Removes replicas for over-replicated objects.
- Deletes objects without owners.