Title: CS556: Distributed Systems
1CS-556 Distributed Systems
Case Studies of Scalable Systems
- Manolis Marazakis
- maraz_at_csd.uoc.gr
2SABRE
- Reservations, inventory tracking
- Airlines, travel agencies,
- Descendant of CRS (Computerized Reservation
System) - Hosted in a number of secure data centers
- Connectivity with major reservation systems
- Amadeus, Apollo, Galileo, WorldSpan,
- Management of PNRs
- Passenger Name Records
- IATA/ARC standards for ticketing
- 2001
- 6 K employees, in 45 countries
- 59 K travel agencies
- 450 airlines, 53 K hotels
- 54 car rental companies, 8 cruise lines, 33
railroads - 228 tour operators
3History (I)
- 1964
- Location New York
- Hosted on 2 IBM 7090 mainframes
- 84 K requests/day
- Development cost 400 man-years, 40 M (USD)
- 1972
- Location Tulsa, Oklahoma
- Hosted on IBM 360s
- The switch caused 15 minutes of service
interruption - 1976
- 130 travel agencies have terminals
- 1978
- Storage of 1 M fares
- 1984
- Bargain Finder service
4History (II)
- 1985
- easySabre PCs can connect to the system as
terminals - 1986
- Automated yield management system (dynamic
pricing) - 1988
- Storage of 36 M fares
- Can be combined gt 1 B fare options
- 1995
- Initiation of Y2K code inspection
- 200 M lines of code
- Interfaces with gt600 suppliers
- New software for gt40K travel agents
- 1,200 H/W S/W systems
- 1996
- Travelocity.com
- 1998
- Joint venture with ABACUS International
- 7,300 travel agencies, in 16 countries (Asia)
- 2000
5Legacy connectivity (I)
- Connection-oriented comm. protocol (sessions)
- ALC Airline Link Control protocol
- Packet-switching
- but not TCP/IP usually X.25
- Requires special H/W (network card)
- Gradual upgrades to Frame-Relay connectivity
- Structured message interfaces
- Emulation of 3270 terminal
- Pre-defined form fields
- Integration with other systems ?
- screen scrapping code
- Message Processors gateways that offer
connectivity to clients that do not use supported
terminals - Encapsulation of ALC over TCP/IP
6Legacy connectivity (II)
There is a large market for gateway/connectivity
products - Eg www.datalex.com
7Porcupine
- A highly available cluster-based mail service
- Built out of commodity H/W
- Mail is important, hard, and easy ?
- Real demand
- AOL, HotMail gt100 M messages/day
- Write-intensive, I/O bound with low locality
- Simple API, inherent parallelism, weak
consistency - Simpler that DDBMS or DFS
- Scalability goals
- Performance linearly scale with cluster size
- Manageability automatic reconfigurations
- Availability gracefully survive multiple failures
8Conventional mail services
Static partitioning of mailboxes - on top of
a FS or DBMS
Performance problems - no dynamic load
balancing
Manageability problems -static, manual
partitioning decisions
Availability problems -If a server goes down,
part of the user population cannot access
their mailboxes
9Functional Homogeneity
- Any node can perform any task
- Interaction with mail clients
- Mail storage
- Any piece of data can be managed at any node
- Techniques
- Replication reconfiguration
- Gracefully survive failures
- Load balancing
- Masking of skews in workload cluster
configuration - Dynamic task data placement decisions
- Messages for a single user can be scattered on
multiple nodes collected only upon request
10Architecture
Protocol handling
User lookup
Load balancing
Message store access
11Operation
- Incoming request send msg to userX
- DNS/RR selection of node (A)
- Who manages userX ? (B)
- A issues request to B for user verification
- B knows where userXs messages are kept (C, D)
- A picks best node for the new msg (D)
- D stores the new msg
Each user is managed by a node and all nodes
must agree on who is managed where
Partitioning of user population using a hash
function
12Strategy for scalable performance
- Avoid creating hot spots
- Partition data uniformly among nodes
- Fine-grain data partition
- Experimental results
- 30-node cluster (PCs with Linux)
- Synthetic load
- derived from University servers log
- Comparison with sendmailpopd
- Sustains 800 msgs/sec (68 M msgs/day)
- As compared to 250 msgs/sec (25 M msgs/day)
13Strategy for reconfiguration (I)
- Hard state messages, user profiles
- Fine-grain optimistic replication
- Soft state user map, mail map
- Reconstructed after reconfiguration
- Membership protocol
- initiated when a crash is detected
- Update of user map data structures
- Broadcast of updated user map
- Distributed disk scan
- Each node scans local FS for msgs owned by
moved users
Total amount of mail map info. that needs to be
recovered from disks is equal to that stored on
the crashed node -gt independent of cluster size
14Strategy for reconfiguration (II)
15Hard-state replication (I)
- Internet semantics
- Optimistic, eventually consistent replication
- Per-message, per-user profile replication
- Small window of inconsistency
- A user may see stale data
- Efficient during normal operation
- For each request, a coordinator node pushes
updates to other nodes - If another node crashes, the coordinator simply
waits for its recovery to complete the update - But does not block
- Coordinator crash
- A replica node will detect it take over
16Hard-state replication (II)
Less than linear degradation -gt Disk logging
overhead - can be reduced by using a separate
disk (or NVRAM)
17Strategy for load balancing
- Deciding where to store msgs
- Spread soft limit on nodes per mailbox
- This limit is violated when nodes crash
- Select node from spread candidates
- Small spread ? better data affinity
- Smaller mail map data structure
- More streamlined disk head movement
- High spread ? better load balancing
- More choices for selection
- Load measure pending I/O operations
18Handling heterogeneity
Better utilization of fast disks (x3 speed)
19Inktomi
- Derivative of the NOW project at UCB
- Led to commercial services
- TranSend, HotBot
- The case against distributed systems ?
- BASE semantics, instead of ACID
- Basically available
- Tolerate (occasionally) stale data
- Soft state
- Reconstructed during recovery
- Eventually consistent
- Responses can be approximate
- Centralized work queues
- Scalable !
20Why not ACID ?
- Much of the data in a network service can
tolerate guarantees weaker than ACID - ACID makes no guarantees about availability
- Indeed, it is preferable for an ACID service to
be unavailable than to relax the ACID
constraints - ACID is well suited for
- Commerce Txs, billing users, maintainance of
user profiles, - For most Internet information services, the users
value availability more than strong consistency
or durability - Web servers, search/aggregation servers,
caching/transformation proxies,
21Cluster architecture (I)
- Front-ends
- Supervision of incoming requests
- Matching requests with profiles (customization
DB) - Enqueue requests for service by one or more
workers - Worker pool
- Caches service-specific modules
- Customization DB
- Manager
- Balancing load across workers, spawning more
workers as load fluctuates or faults occur - System Area Network (SAN)
- Graphical Monitor
22Cluster architecture (II)
caches
Worker stub
23Networked, commodity workstations
- Incremental growth
- Automated, centralized administration and
monitoring. - Boot image management
- Software load management
- Firewall settings
- Console access
- Application configuration
- Convenient for supporting a system
componentconsider cost of decomposition versus
efficiencies of SMP - Partial failure
- Shared state
- Distributed shared memory abstraction
243-layer decomposition
- Service
- User interface to control service
- Device-specific presentation
- Allow workers to remain stateless
- TACC API
- Transformation
- Filtering, transcoding, re-rendering, encryption,
compression - Aggregation
- Composition (pipeline chaining) of stateless
modules - Caching
- Original, post-aggregation, post-transformation
data - Customization
- SNS Scalable Network Service support
- Worker load balancing, overflow management
- Fault-tolerance
- System monitoring logging
- Incremental scalability
25Load management
- Load balancing hints
- Computed by Manager, based on load measurements
from workers - Periodically transmitted to front-ends
- Overflow pool
- Absorb load bursts
- Relatively rare, but prolonged
- Eg Pathfinder landing on Mars -gt 220 M hits, in
4 days - Spare machines on which the Manager can spawn
workers on demand - Workers are interchangeable
26Fault tolerance availability
- Construct robust entities by relying on cached
soft state, refreshed by periodic messages - Transient component failures are a fact of life
- Process peer fault tolerance
- When a component fails, one of its peers restarts
it (possibly on a different node) - In the meantime, cached state (possibly stale) is
still available to the surviving components - A restarted component gradually reconstructs its
soft state - Typically by listening to multicasts from others
- Not the same as process pair
- Requires hard state
- Timeouts to infer failures
27The HotBot search engine
- SPARCstations, interconnected via Myrinet
- Front-ends
- 50-80 threads per node
- dynamic HTML (TCL script tags)
- Load balancing
- Static partition of search DB
- Every query goes to all workers in parallel
- Workers are not 100 interchangeable
- Each worker has a local disk
- Version 1 DB fragments are cross-mounted
- So that other nodes can reach the data, with
graceful performance degradation - Version 2 RAID
- 26 nodes loss of 1 node resulted in the
available DB dropping from 54 M documents to 51
M - Informix DB for user profiles ad revenue
tracking - Primary/backup failover
28TranSend
- Caching transformation proxy
- SPARCstations interconnected via 10 Mb/s switched
Ethernet dialup pool - Thread per TCP connection
- Single front-end, with a total of 400 threads
- Pipelining of distillers
- Lossy-compression workers
- Centralized Manager
- Periodic IP multicast to announce its presence
- No static binding is required for workers
- Workers periodically report load metric
- distillers queue length, weighted by a factor
reflecting the expected execution cost - Version 1 process-pair recovery
- Version 2 soft state with watcher process
periodic beacon of state updates
29Internet service workloads
- Yahoo 625 M page views / day
- HTML 7 KB, Images 10 KB
- AOLs proxy 5.2 B requests / day
- Response size 5.5 KB
- Services often take 100s of millisecs
- Responses take several seconds to flow back
- High task throughput non-neglible latency
- A service may have to sustain 1000s of
simultaneous tasks - C10K problem
- Human users 4 parallel HTTP/GET requests
spawned per page view - A large of service tasks are independent of
each other
30DDS (S.D. Gribble)
- Self-managing, cluster-based data repository
- Seen by services as a conventional data structure
- Log, tree, hash table
- High performance
- 60 K reads/sec, over 1.28 TB of data
- 128-node cluster
- The CAP principle
- A system can have at most two of the following
properties - Consistency
- Availability
- Tolerance to network Partitions
31CAP trade-offs
32SEDA (M. Welsh)
- Staged, event-driven architecture
- Service stages, linked via queues
- Thread pool per stage
- Massive concurrency
- Admission priority control on each individual
queue - Adaptive load balancing
- Feedback loop
- No a-priori resource limits
33Overhead of concurrency (I)
34Overhead of concurrency (II)
35SEDA architecture (I)
36SEDA architecture (II)
37SEDA architecture (III)
38Clusters for Internet Services
- Previous observation (TACC, Inktomi, NOW)
- Clusters of workstations are a natural platform
for constructing Internet services - Internet service properties
- support large, rapidly growing user populations
- must remain highly available, and cost-effective
- Clusters offer a tantalizing solution
- incremental scalability cluster grows with
service - natural parallelism high performance platform
- software and hardware redundancy fault-tolerance
39Software troubles
- Internet service construction on clusters is hard
- load balancing, process management,
communications abstractions, I/O balancing,
fail-over and restart, - toolkits proposed to help (TACC, AS1, River, )
- Even harder if shared, persistent state is
involved - data partitioning, replication, and consistency,
interacting with storage subsystem, - solutions not geared to clustered services
- use (distributed) RDBMS expensive, powerful
semantic guarantees, generality at cost of
performance - use network/distributed FS overly general, high
overhead (e.g. double buffering penalties).
Fault-tolerance? - Roll-your-own custom solution not reusable,
complex
40Idea / Hypothesis
- It is possible to
- isolate clustered services from vagaries of state
mgmt., - to do so with adequately general abstractions,
- to build those abstractions in a layered fashion
(reuse), - and to exploit clusters for performance, and
simplicity. - Scalable Distributed Data Structures (SDDS)
- take conventional data structure
- hash table, tree, log,
- partition it across nodes in a cluster
- parallel access, scalability,
- replicate partitions within replica groups in
cluster - availability in face of failures, further
parallelism - store replicas on disk
41Why SDDS?
- Fundamental software engineering principle
- Separation of concerns
- decouple persistency/consistency logic from rest
of service - simpler (and cleaner!) service implementations
- Service authors understand data structures
- familiar behavior and interfaces from single-node
case - should enable rapid development of new services
- Structure access patterns are self-evident
- access granularity manifestly a structure element
- coincidence of logical and physical data units
- cf. file systems, SQL in RDBMS, VM pages in DSM
42SDDS Challenges
- Overcoming complexities of distributed systems
- data consistency, data distribution, request load
balancing, hiding network latency and OS
overhead, - ace up the sleeve cluster ? wide area
- single, controlled administrative domain
- engineer to (probabilistically) avoid network
partitions - use low-latency, high-throughput SAN (5 µs,
40-120 MB/s) - predictable behavior, controlled heterogeneity
- I/O is still a problem
- Plenty of work on fast network I/O
- some on fast disk I/O
- Less work bridging network ? disk I/O in cluster
environment
Segment-based cluster I/O layer Filtered
streams bet. Disks, network, memory
43Segment layer (motivation)
- Its all about disk bandwidth avoiding seeks
- 8 ms random seek, 25-80 MB/s throughput
- must read 320 KB per seek to break even
- Build disk abstraction layer based on segments
- 1-2 MB regions on disk, read and written in their
entirety - force upper layers to design with this in mind
- small reads/writes treated as uncommon failure
case - SAN throughput is comparable to disk throughput
- Stream from disk to network saturate both
channels - stream through service-specific filter functions
- selection, transformation,
- Apply lessons from high-performance networks
44Taxonomy of Clustered Services
Stateless
Soft-state
Persistent State
- high availability and
- completeness
- perhaps consistency
- persistence necessary
- high availability
- perhaps consistency
- persistence is an
- optimization
State Mgmt. Requirements
Little or none
TACC distillers
TACC aggregators
Inktomi search engine
River modules
AS1 servents or RMX
Parallelisms
Examples
Scalable PIM apps
Video Gateway
Squid web cache
HINDE mint
45Clustering
- Goal
- Take a cluster of commodity workstations make
them look like a supercomputer. - Problems
- Application structure
- Partial failure management
- Interconnect technology
- System administration
46Cluster Prehistory Tandem NonStop
- Early (1974) foray into transparent fault
tolerance through redundancy - Mirror everything (CPU, storage, power supplies)
- can tolerate any single fault (later processor
duplexing) - Hot standby process pair approach
- Whats the difference between high availability
fault tolerance? - Noteworthy
- Shared nothing--why?
- Performance and efficiency costs?
- Later evolved into Tandem Himalaya
- used clustering for both higher performance
higher availability
47Pre-NOW Clustering in the 90s
- IBM Parallel Sysplex and DEC OpenVMS
- Targeted at conservative (read mainframe)
customers - Shared disks allowed under both (why?)
- All devices have cluster-wide names (shared
everything?) - 1500 installations of Sysplex, 25,000 of OpenVMS
Cluster - Programming the clusters
- All System/390 and VAX VMS subsystems were
rewritten to be cluster-aware - OpenVMS cluster support exists even in
single-node OS! - An advantage of locking into proprietary
interfaces - What about fault tolerance?
48The Case For NOW MPPs a Near Miss
- Uniprocessor performance improves by 50 / yr
(4/month) - 1 year lagWS 1.50 MPP node perf.
- 2 year lagWS 2.25 MPP node perf.
- No economy of scale in 100s gt
- Software incompatibility (OS apps) gt
- More efficient utilization of compute resources
- statistical multiplexing
- Scale makes availability affordable (Pfister)
- Which of these do commodity clusters actually
solve?
49Philosophy Systems of Systems
- Higher Order systems research
- aggressively use off-the-shelf hardware OS
software - Advantages
- easier to track technological advances
- less development time
- easier to transfer technology (reduce lag)
- New challenges (the case against NOW)
- maintaining performance goals
- system is changing underneath you
- underlying system has other people's bugs
- underlying system is poorly documented
50Clusters Enhanced Standard Litany
- Software engineering
- Partial failure management
- Incremental scalability
- System administration
- Heterogeneity
- Hardware redundancy
- Aggregate capacity
- Incremental scalability
- Absolute scalability
- Price/performance sweet spot
51Clustering Internet Services
- Aggregate capacity
- TB of disk storage, THz of compute power
- If we only we could harness it in parallel!
- Redundancy
- Partial failure behavior only small fractional
degradation from loss of one node - Availability industry average across large
sites during 1998 holiday season was 97.2
availability (source CyberAtlas) - Compare mission-critical systems have four
nines (99.99)
52Spike Absorption
- Internet traffic is self-similar
- Bursty at all granularities less than about 24
hours - Whats bad about burstiness?
- Spike Absorption
- Diurnal variation
- Peak vs. average demand typically a factor of 3
or more - Starr Report CNN peaked at 20M hits/hour
(compared to usual peak of 12M hits/hour thats
66) - Really the holy grail capacity on demand
- Is this realistic?
53Diurnal Cycle (UCB dialups, Jan. 1997)
- 750 modems at UC Berkeley
- Instrumented early 1997
54Clustering Internet Workloads
- Internet vs. traditional workloads
- e.g. Database workloads (TPC benchmarks)
- e.g. traditional scientific codes (matrix
multiply, simulated annealing and related
simulations, etc.) - Some characteristic differences
- Read mostly
- Quality of service (best-effort vs. guarantees)
- Task granularity
- Embarrassingly parallel
- but are they balanced?
55Meeting the Cluster Challenges
- Software programming models
- Partial failure and application semantics
- System administration
56Software Challenges (I)
- Message-passing Active Messages
- Shared memory Network RAM
- CC-NUMA, Software DSM
- MP vs SM a long-standing religious debate
- Arbitrary object migration (network
transparency) - What are the problems with this?
- Hints RPC, checkpointing, residual state
57Software Challenges (II)
- Real issue we have to think differently about
programming - to harness clusters?
- to get decent failure semantics?
- to really exploit software modularity?
- Traditional uniprocessor programming
idioms/models dont seem to scale up to clusters - Question Is there a natural to use cluster
model that scales down to uniprocessors? - If so, is it general or application-specific?
- What would be the obstacles to adopting such a
model?
58Partial Failure Management
- What does partial failure mean for
- a transactional database?
- A read-only database striped across cluster
nodes? - A compute-intensive shared service?
- What are appropriate partial failure
abstractions? - Incomplete/imprecise results?
- Longer latency?
- What current programming idioms make partial
failure hard?
59Cluster System Administration (I)
- Total cost of ownership (TCO) way high for
clusters - Median sysadmin cost per machine per year (1996)
700 - Cost of a headless workstation today 1500
- Previous Solutions
- Pay someone to watch
- Ignore or wait for someone to complain
- Shell Scripts From Hell
- not general ? vast repeated work
- Need an extensible and scalable way to automate
the gathering, analysis, and presentation of data
60Cluster System Administration (II)
- Extensible Scalable Monitoring For Clusters of
Computers (Anderson Patterson, UC Berkeley) - Relational tables allow properties queries of
interest to evolve as the cluster evolves - Extensive visualization support allows humans to
make sense of masses of data - Multiple levels of caching decouple data
collection from aggregation - Data updates can be pulled on demand or
triggered by push
61References
- Y. Saito, B.N. Bershad and H.M. Levy,
Manageability, availability and performance in
Porcupine a highly scalable, cluster-based mail
service, Proc. 17th ACM SOSP, 1999. - S.D. Gribble, E.A. Brewer, J.M. Hellerstein, and
D. Culler, Scalable, distributed data
structures for Internet service construction,
Proc. 4th OSDI, 2000. - A. Fox, S.D. Gribble, Y. Chawathe, E.A. Brewer,
P. Gauthier, Cluster-based scalable network
services, Proc. 16th ACM SOSP, 1997.