CS556: Distributed Systems - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

CS556: Distributed Systems

Description:

Case Studies of Scalable Systems: Porcupine, Inktomi (& a taste of SABRE) Fall Semester 2002 ... Porcupine. A highly available cluster-based mail service ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 39

Provided by: mar177

Category:

more less

Transcript and Presenter's Notes

Title: CS556: Distributed Systems

1
CS-556 Distributed Systems
Case Studies of Scalable Systems Porcupine,
Inktomi( a taste of SABRE)

Manolis Marazakis
maraz_at_csd.uoc.gr

2
Porcupine

A highly available cluster-based mail service
Built out of commodity H/W
Mail is important, hard, and easy ?
Real demand
AOL, HotMail gt100 M messages/day
Write-intensive, I/O bound with low locality
Simple API, inherent parallelism, weak
consistency
Simpler that DDBMS or DFS
Scalability goals
Performance linearly scale with cluster size
Manageability automatic reconfigurations
Availability gracefully survive multiple failures

3
Conventional mail services
Static partitioning of mailboxes - on top of
a FS or DBMS
Performance problems - no dynamic load
balancing
Manageability problems -static, manual
partitioning decisions
Availability problems -If a server goes down,
part of the user population cannot access
their mailboxes
4
Functional Homogeneity

Any node can perform any task
Interaction with mail clients
Mail storage
Any piece of data can be managed at any node
Techniques
Replication reconfiguration
Gracefully survive failures
Load balancing
Masking of skews in workload cluster
configuration
Dynamic task data placement decisions
Messages for a single user can be scattered on
multiple nodes collected only upon request

5
Architecture
Protocol handling
User lookup
Load balancing
Message store access
6
Operation

Incoming request send msg to userX
DNS/RR selection of node (A)
Who manages userX ? (B)
A issues request to B for user verification
B knows where userXs messages are kept (C, D)
A picks best node for the new msg (D)
D stores the new msg

Each user is managed by a node and all nodes
must agree on who is managed where
Partitioning of user population using a hash
function
7
Strategy for scalable performance

Avoid creating hot spots
Partition data uniformly among nodes
Fine-grain data partition
Experimental results
30-node cluster (PCs with Linux)
Synthetic load
derived from University servers log
Comparison with sendmailpopd
Sustains 800 msgs/sec (68 M msgs/day)
As compared to 250 msgs/sec (25 M msgs/day)

8
Strategy for reconfiguration (I)

Hard state messages, user profiles
Fine-grain optimistic replication
Soft state user map, mail map
Reconstructed after reconfiguration
Membership protocol
initiated when a crash is detected
Update of user map data structures
Broadcast of updated user map
Distributed disk scan
Each node scans local FS for msgs owned by
moved users

Total amount of mail map info. that needs to be
recovered from disks is equal to that stored on
the crashed node -gt independent of cluster size
9
Strategy for reconfiguration (II)
10
Hard-state replication (I)

Internet semantics
Optimistic, eventually consistent replication
Per-message, per-user profile replication
Small window of inconsistency
A user may see stale data
Efficient during normal operation
For each request, a coordinator node pushes
updates to other nodes
If another node crashes, the coordinator simply
waits for its recovery to complete the update
But does not block
Coordinator crash
A replica node will detect it take over

11
Hard-state replication (II)
Less than linear degradation -gt Disk logging
overhead - can be reduced by using a separate
disk (or NVRAM)
12
Strategy for load balancing

Deciding where to store msgs
Spread soft limit on nodes per mailbox
This limit is violated when nodes crash
Select node from spread candidates
Small spread -gt better data affinity
Smaller mail map data structure
More streamlined disk head movement
High spread -gt better load balancing
More choices for selection
Load measure pending I/O operations

13
Handling heterogeneity
Better utilization of fast disks (x3 speed)
14
Inktomi

Derivative of the NOW project at UCB
Led to commercial services
TranSend, HotBot
The case against distributed systems ?
BASE semantics, instead of ACID
Basically available
Tolerate (occasionally) stale data
Soft state
Reconstructed during recovery
Eventually consistent
Responses can be approximate
Centralized work queues
Scalable !

15
Why not ACID ?

Much of the data in a network service can
tolerate guarantees weaker than ACID
ACID makes no guarantees about availability
Indeed, it is preferable for an ACID service to
be unavailable than to relax the ACID
constraints
ACID is well suited for
Commerce Txs, billing users, maintainance of
user profiles,
For most Internet information services, the users
value availability more than strong consistency
or durability
Web servers, search/aggregation servers,
caching/transformation proxies,

16
Cluster architecture (I)

Front-ends
Supervision of incoming requests
Matching requests with profiles (customization
DB)
Enqueue requests for service by one or more
workers
Worker pool
Caches service-specific modules
Customization DB
Manager
Balancing load across workers, spawning more
workers as load fluctuates or faults occur
System Area Network (SAN)
Graphical Monitor

17
Cluster architecture (II)
caches
Worker stub
18
Networked, commodity workstations

Incremental growth
Automated, centralized administration and
monitoring.
Boot image management
Software load management
Firewall settings
Console access
Application configuration
Convenient for supporting a system
componentconsider cost of decomposition versus
efficiencies of SMP
Partial failure
Shared state
Distributed shared memory abstraction

19
3-layer decomposition

Service
User interface to control service
Device-specific presentation
Allow workers to remain stateless
TACC API
Transformation
Filtering, transcoding, re-rendering, encryption,
compression
Aggregation
Composition (pipeline chaining) of stateless
modules
Caching
Original, post-aggregation, post-transformation
data
Customization
SNS Scalable Network Service support
Worker load balancing, overflow management
Fault-tolerance
System monitoring logging
Incremental scalability

20
Load management

Load balancing hints
Computed by Manager, based on load measurements
from workers
Periodically transmitted to front-ends
Overflow pool
Absorb load bursts
Relatively rare, but prolonged
Eg Pathfinder landing on Mars -gt 220 M hits, in
4 days
Spare machines on which the Manager can spawn
workers on demand
Workers are interchangeable

21
Fault tolerance availability

Construct robust entities by relying on cached
soft state, refreshed by periodic messages
Transient component failures are a fact of life
Process peer fault tolerance
When a component fails, one of its peers restarts
it (possibly on a different node)
In the meantime, cached state (possibly stale) is
still available to the surviving components
A restarted component gradually reconstructs its
soft state
Typically by listening to multicasts from others
Not the same as process pair
Requires hard state
Timeouts to infer failures

22
The HotBot search engine

SPARCstations, interconnected via Myrinet
Front-ends
50-80 threads per node
dynamic HTML (TCL script tags)
Load balancing
Static partition of search DB
Every query goes to all workers in parallel
Workers are not 100 interchangeable
Each worker has a local disk
Version 1 DB fragments are cross-mounted
So that other nodes can reach the data, with
graceful performance degradation
Version 2 RAID
26 nodes loss of 1 node resulted in the
available DB dropping from 54 M documents to 51
M
Informix DB for user profiles ad revenue
tracking
Primary/backup failover

23
TranSend

Caching transformation proxy
SPARCstations interconnected via 10 Mb/s switched
Ethernet dialup pool
Thread per TCP connection
Single front-end, with a total of 400 threads
Pipelining of distillers
Lossy-compression workers
Centralized Manager
Periodic IP multicast to announce its presence
No static binding is required for workers
Workers periodically report load metric
distillers queue length, weighted by a factor
reflecting the expected execution cost
Version 1 process-pair recovery
Version 2 soft state with watcher process
periodic beacon of state updates

24
Internet service workloads

Yahoo 625 M page views / day
HTML 7 KB, Images 10 KB
AOLs proxy 5.2 B requests / day
Response size 5.5 KB
Services often take 100s of millisecs
Responses take several seconds to flow back
High task throughput non-neglible latency
A service may have to sustain 1000s of
simultaneous tasks
C10K problem
Human users 4 parallel HTTP/GET requests
spawned per page view
A large of service tasks are independent of
each other

25
DDS (S.D. Gribble)

Self-managing, cluster-based data repository
Seen by services as a conventional data structure
Log, tree, hash table
High performance
60 K reads/sec, over 1.28 TB of data
128-node cluster
The CAP principle
A system can have at most two of the following
properties
Consistency
Availability
Tolerance to network Partitions

26
CAP trade-offs
27
SEDA (M. Welsh)

Staged, event-driven architecture
Service stages, linked via queues
Thread pool per stage
Massive concurrency
Admission priority control on each individual
queue
Adaptive load balancing
Feedback loop
No a-priori resource limits

28
Overhead of concurrency (I)
29
Overhead of concurrency (II)
30
SEDA architecture (I)
31
SEDA architecture (II)
32
SEDA architecture (III)
33
SABRE

Reservations, inventory tracking
Airlines, travel agencies,
Descendant of CRS (Computerized Reservation
System)
Hosted in a number of secure data centers
Connectivity with major reservation systems
Amadeus, Apollo, Galileo, WorldSpan,
Management of PNRs
Passenger Name Records
IATA/ARC standards for ticketing
2001
6 K employees, in 45 countries
59 K travel agencies
450 airlines, 53 K hotels
54 car rental companies, 8 cruise lines, 33
railroads
228 tour operators

34
History (I)

1964
Location New York
Hosted on 2 IBM 7090 mainframes
84 K requests/day
Development cost 400 man-years, 40 M (USD)
1972
Location Tulsa, Oklahoma
Hosted on IBM 360s
The switch caused 15 minutes of service
interruption
1976
130 travel agencies have terminals
1978
Storage of 1 M fares
1984
Bargain Finder service

35
History (II)

1985
easySabre PCs can connect to the system as
terminals
1986
Automated yield management system (dynamic
pricing)
1988
Storage of 36 M fares
Can be combined gt 1 B fare options
1995
Initiation of Y2K code inspection
200 M lines of code
Interfaces with gt600 suppliers
New software for gt40K travel agents
1,200 H/W S/W systems
1996
Travelocity.com
1998
Joint venture with ABACUS International
7,300 travel agencies, in 16 countries (Asia)
2000

36
Legacy connectivity (I)

Connection-oriented comm. protocol (sessions)
ALC Airline Link Control protocol
Packet-switching
but not TCP/IP usually X.25
Requires special H/W (network card)
Gradual upgrades to Frame-Relay connectivity
Structured message interfaces
Emulation of 3270 terminal
Pre-defined form fields
Integration with other systems ?
screen scrapping code
Message Processors gateways that offer
connectivity to clients that do not use supported
terminals
Encapsulation of ALC over TCP/IP

37
Legacy connectivity (II)
There is a large market for gateway/connectivity
products - Eg www.datalex.com
38
References

Y. Saito, B.N. Bershad and H.M. Levy,
Manageability, availability and performance in
Porcupine a highly scalable, cluster-based mail
service, Proc. 17th ACM SOSP, 1999.
S.D. Gribble, E.A. Brewer, J.M. Hellerstein, and
D. Culler, Scalable, distributed data
structures for Internet service construction,
Proc. 4th OSDI, 2000.
A. Fox, S.D. Gribble, Y. Chawathe, E.A. Brewer,
P. Gauthier, Cluster-based scalable network
services, Proc. 16th ACM SOSP, 1997.
http//www.sabre.com