Title: CS556: Distributed Systems
1CS-556 Distributed Systems
Case Studies of Scalable Systems Porcupine,
Inktomi( a taste of SABRE)
- Manolis Marazakis
- maraz_at_csd.uoc.gr
2Porcupine
- A highly available cluster-based mail service
- Built out of commodity H/W
- Mail is important, hard, and easy ?
- Real demand
- AOL, HotMail gt100 M messages/day
- Write-intensive, I/O bound with low locality
- Simple API, inherent parallelism, weak
consistency - Simpler that DDBMS or DFS
- Scalability goals
- Performance linearly scale with cluster size
- Manageability automatic reconfigurations
- Availability gracefully survive multiple failures
3Conventional mail services
Static partitioning of mailboxes - on top of
a FS or DBMS
Performance problems - no dynamic load
balancing
Manageability problems -static, manual
partitioning decisions
Availability problems -If a server goes down,
part of the user population cannot access
their mailboxes
4Functional Homogeneity
- Any node can perform any task
- Interaction with mail clients
- Mail storage
- Any piece of data can be managed at any node
- Techniques
- Replication reconfiguration
- Gracefully survive failures
- Load balancing
- Masking of skews in workload cluster
configuration - Dynamic task data placement decisions
- Messages for a single user can be scattered on
multiple nodes collected only upon request
5Architecture
Protocol handling
User lookup
Load balancing
Message store access
6Operation
- Incoming request send msg to userX
- DNS/RR selection of node (A)
- Who manages userX ? (B)
- A issues request to B for user verification
- B knows where userXs messages are kept (C, D)
- A picks best node for the new msg (D)
- D stores the new msg
Each user is managed by a node and all nodes
must agree on who is managed where
Partitioning of user population using a hash
function
7Strategy for scalable performance
- Avoid creating hot spots
- Partition data uniformly among nodes
- Fine-grain data partition
- Experimental results
- 30-node cluster (PCs with Linux)
- Synthetic load
- derived from University servers log
- Comparison with sendmailpopd
- Sustains 800 msgs/sec (68 M msgs/day)
- As compared to 250 msgs/sec (25 M msgs/day)
8Strategy for reconfiguration (I)
- Hard state messages, user profiles
- Fine-grain optimistic replication
- Soft state user map, mail map
- Reconstructed after reconfiguration
- Membership protocol
- initiated when a crash is detected
- Update of user map data structures
- Broadcast of updated user map
- Distributed disk scan
- Each node scans local FS for msgs owned by
moved users
Total amount of mail map info. that needs to be
recovered from disks is equal to that stored on
the crashed node -gt independent of cluster size
9Strategy for reconfiguration (II)
10Hard-state replication (I)
- Internet semantics
- Optimistic, eventually consistent replication
- Per-message, per-user profile replication
- Small window of inconsistency
- A user may see stale data
- Efficient during normal operation
- For each request, a coordinator node pushes
updates to other nodes - If another node crashes, the coordinator simply
waits for its recovery to complete the update - But does not block
- Coordinator crash
- A replica node will detect it take over
11Hard-state replication (II)
Less than linear degradation -gt Disk logging
overhead - can be reduced by using a separate
disk (or NVRAM)
12Strategy for load balancing
- Deciding where to store msgs
- Spread soft limit on nodes per mailbox
- This limit is violated when nodes crash
- Select node from spread candidates
- Small spread -gt better data affinity
- Smaller mail map data structure
- More streamlined disk head movement
- High spread -gt better load balancing
- More choices for selection
- Load measure pending I/O operations
13Handling heterogeneity
Better utilization of fast disks (x3 speed)
14Inktomi
- Derivative of the NOW project at UCB
- Led to commercial services
- TranSend, HotBot
- The case against distributed systems ?
- BASE semantics, instead of ACID
- Basically available
- Tolerate (occasionally) stale data
- Soft state
- Reconstructed during recovery
- Eventually consistent
- Responses can be approximate
- Centralized work queues
- Scalable !
15Why not ACID ?
- Much of the data in a network service can
tolerate guarantees weaker than ACID - ACID makes no guarantees about availability
- Indeed, it is preferable for an ACID service to
be unavailable than to relax the ACID
constraints - ACID is well suited for
- Commerce Txs, billing users, maintainance of
user profiles, - For most Internet information services, the users
value availability more than strong consistency
or durability - Web servers, search/aggregation servers,
caching/transformation proxies,
16Cluster architecture (I)
- Front-ends
- Supervision of incoming requests
- Matching requests with profiles (customization
DB) - Enqueue requests for service by one or more
workers - Worker pool
- Caches service-specific modules
- Customization DB
- Manager
- Balancing load across workers, spawning more
workers as load fluctuates or faults occur - System Area Network (SAN)
- Graphical Monitor
17Cluster architecture (II)
caches
Worker stub
18Networked, commodity workstations
- Incremental growth
- Automated, centralized administration and
monitoring. - Boot image management
- Software load management
- Firewall settings
- Console access
- Application configuration
- Convenient for supporting a system
componentconsider cost of decomposition versus
efficiencies of SMP - Partial failure
- Shared state
- Distributed shared memory abstraction
193-layer decomposition
- Service
- User interface to control service
- Device-specific presentation
- Allow workers to remain stateless
- TACC API
- Transformation
- Filtering, transcoding, re-rendering, encryption,
compression - Aggregation
- Composition (pipeline chaining) of stateless
modules - Caching
- Original, post-aggregation, post-transformation
data - Customization
- SNS Scalable Network Service support
- Worker load balancing, overflow management
- Fault-tolerance
- System monitoring logging
- Incremental scalability
20Load management
- Load balancing hints
- Computed by Manager, based on load measurements
from workers - Periodically transmitted to front-ends
- Overflow pool
- Absorb load bursts
- Relatively rare, but prolonged
- Eg Pathfinder landing on Mars -gt 220 M hits, in
4 days - Spare machines on which the Manager can spawn
workers on demand - Workers are interchangeable
21Fault tolerance availability
- Construct robust entities by relying on cached
soft state, refreshed by periodic messages - Transient component failures are a fact of life
- Process peer fault tolerance
- When a component fails, one of its peers restarts
it (possibly on a different node) - In the meantime, cached state (possibly stale) is
still available to the surviving components - A restarted component gradually reconstructs its
soft state - Typically by listening to multicasts from others
- Not the same as process pair
- Requires hard state
- Timeouts to infer failures
22The HotBot search engine
- SPARCstations, interconnected via Myrinet
- Front-ends
- 50-80 threads per node
- dynamic HTML (TCL script tags)
- Load balancing
- Static partition of search DB
- Every query goes to all workers in parallel
- Workers are not 100 interchangeable
- Each worker has a local disk
- Version 1 DB fragments are cross-mounted
- So that other nodes can reach the data, with
graceful performance degradation - Version 2 RAID
- 26 nodes loss of 1 node resulted in the
available DB dropping from 54 M documents to 51
M - Informix DB for user profiles ad revenue
tracking - Primary/backup failover
23TranSend
- Caching transformation proxy
- SPARCstations interconnected via 10 Mb/s switched
Ethernet dialup pool - Thread per TCP connection
- Single front-end, with a total of 400 threads
- Pipelining of distillers
- Lossy-compression workers
- Centralized Manager
- Periodic IP multicast to announce its presence
- No static binding is required for workers
- Workers periodically report load metric
- distillers queue length, weighted by a factor
reflecting the expected execution cost - Version 1 process-pair recovery
- Version 2 soft state with watcher process
periodic beacon of state updates
24Internet service workloads
- Yahoo 625 M page views / day
- HTML 7 KB, Images 10 KB
- AOLs proxy 5.2 B requests / day
- Response size 5.5 KB
- Services often take 100s of millisecs
- Responses take several seconds to flow back
- High task throughput non-neglible latency
- A service may have to sustain 1000s of
simultaneous tasks - C10K problem
- Human users 4 parallel HTTP/GET requests
spawned per page view - A large of service tasks are independent of
each other
25DDS (S.D. Gribble)
- Self-managing, cluster-based data repository
- Seen by services as a conventional data structure
- Log, tree, hash table
- High performance
- 60 K reads/sec, over 1.28 TB of data
- 128-node cluster
- The CAP principle
- A system can have at most two of the following
properties - Consistency
- Availability
- Tolerance to network Partitions
26CAP trade-offs
27SEDA (M. Welsh)
- Staged, event-driven architecture
- Service stages, linked via queues
- Thread pool per stage
- Massive concurrency
- Admission priority control on each individual
queue - Adaptive load balancing
- Feedback loop
- No a-priori resource limits
28Overhead of concurrency (I)
29Overhead of concurrency (II)
30SEDA architecture (I)
31SEDA architecture (II)
32SEDA architecture (III)
33SABRE
- Reservations, inventory tracking
- Airlines, travel agencies,
- Descendant of CRS (Computerized Reservation
System) - Hosted in a number of secure data centers
- Connectivity with major reservation systems
- Amadeus, Apollo, Galileo, WorldSpan,
- Management of PNRs
- Passenger Name Records
- IATA/ARC standards for ticketing
- 2001
- 6 K employees, in 45 countries
- 59 K travel agencies
- 450 airlines, 53 K hotels
- 54 car rental companies, 8 cruise lines, 33
railroads - 228 tour operators
34History (I)
- 1964
- Location New York
- Hosted on 2 IBM 7090 mainframes
- 84 K requests/day
- Development cost 400 man-years, 40 M (USD)
- 1972
- Location Tulsa, Oklahoma
- Hosted on IBM 360s
- The switch caused 15 minutes of service
interruption - 1976
- 130 travel agencies have terminals
- 1978
- Storage of 1 M fares
- 1984
- Bargain Finder service
35History (II)
- 1985
- easySabre PCs can connect to the system as
terminals - 1986
- Automated yield management system (dynamic
pricing) - 1988
- Storage of 36 M fares
- Can be combined gt 1 B fare options
- 1995
- Initiation of Y2K code inspection
- 200 M lines of code
- Interfaces with gt600 suppliers
- New software for gt40K travel agents
- 1,200 H/W S/W systems
- 1996
- Travelocity.com
- 1998
- Joint venture with ABACUS International
- 7,300 travel agencies, in 16 countries (Asia)
- 2000
36Legacy connectivity (I)
- Connection-oriented comm. protocol (sessions)
- ALC Airline Link Control protocol
- Packet-switching
- but not TCP/IP usually X.25
- Requires special H/W (network card)
- Gradual upgrades to Frame-Relay connectivity
- Structured message interfaces
- Emulation of 3270 terminal
- Pre-defined form fields
- Integration with other systems ?
- screen scrapping code
- Message Processors gateways that offer
connectivity to clients that do not use supported
terminals - Encapsulation of ALC over TCP/IP
37Legacy connectivity (II)
There is a large market for gateway/connectivity
products - Eg www.datalex.com
38References
- Y. Saito, B.N. Bershad and H.M. Levy,
Manageability, availability and performance in
Porcupine a highly scalable, cluster-based mail
service, Proc. 17th ACM SOSP, 1999. - S.D. Gribble, E.A. Brewer, J.M. Hellerstein, and
D. Culler, Scalable, distributed data
structures for Internet service construction,
Proc. 4th OSDI, 2000. - A. Fox, S.D. Gribble, Y. Chawathe, E.A. Brewer,
P. Gauthier, Cluster-based scalable network
services, Proc. 16th ACM SOSP, 1997. - http//www.sabre.com