CS556: Distributed Systems

About This Presentation

Title:

CS556: Distributed Systems

Description:

Airlines, travel agencies, ... Descendant of CRS (Computerized Reservation System) ... 59 K travel agencies. 450 airlines, 53 K hotels ... – PowerPoint PPT presentation

Number of Views:117

Avg rating:3.0/5.0

Slides: 62

Provided by: mar177

Category:

more less

Transcript and Presenter's Notes

Title: CS556: Distributed Systems

1
CS-556 Distributed Systems
Case Studies of Scalable Systems

Manolis Marazakis
maraz_at_csd.uoc.gr

2
SABRE

Reservations, inventory tracking
Airlines, travel agencies,
Descendant of CRS (Computerized Reservation
System)
Hosted in a number of secure data centers
Connectivity with major reservation systems
Amadeus, Apollo, Galileo, WorldSpan,
Management of PNRs
Passenger Name Records
IATA/ARC standards for ticketing
2001
6 K employees, in 45 countries
59 K travel agencies
450 airlines, 53 K hotels
54 car rental companies, 8 cruise lines, 33
railroads
228 tour operators

3
History (I)

1964
Location New York
Hosted on 2 IBM 7090 mainframes
84 K requests/day
Development cost 400 man-years, 40 M (USD)
1972
Location Tulsa, Oklahoma
Hosted on IBM 360s
The switch caused 15 minutes of service
interruption
1976
130 travel agencies have terminals
1978
Storage of 1 M fares
1984
Bargain Finder service

4
History (II)

1985
easySabre PCs can connect to the system as
terminals
1986
Automated yield management system (dynamic
pricing)
1988
Storage of 36 M fares
Can be combined gt 1 B fare options
1995
Initiation of Y2K code inspection
200 M lines of code
Interfaces with gt600 suppliers
New software for gt40K travel agents
1,200 H/W S/W systems
1996
Travelocity.com
1998
Joint venture with ABACUS International
7,300 travel agencies, in 16 countries (Asia)
2000

5
Legacy connectivity (I)

Connection-oriented comm. protocol (sessions)
ALC Airline Link Control protocol
Packet-switching
but not TCP/IP usually X.25
Requires special H/W (network card)
Gradual upgrades to Frame-Relay connectivity
Structured message interfaces
Emulation of 3270 terminal
Pre-defined form fields
Integration with other systems ?
screen scrapping code
Message Processors gateways that offer
connectivity to clients that do not use supported
terminals
Encapsulation of ALC over TCP/IP

6
Legacy connectivity (II)
There is a large market for gateway/connectivity
products - Eg www.datalex.com
7
Porcupine

A highly available cluster-based mail service
Built out of commodity H/W
Mail is important, hard, and easy ?
Real demand
AOL, HotMail gt100 M messages/day
Write-intensive, I/O bound with low locality
Simple API, inherent parallelism, weak
consistency
Simpler that DDBMS or DFS
Scalability goals
Performance linearly scale with cluster size
Manageability automatic reconfigurations
Availability gracefully survive multiple failures

8
Conventional mail services
Static partitioning of mailboxes - on top of
a FS or DBMS
Performance problems - no dynamic load
balancing
Manageability problems -static, manual
partitioning decisions
Availability problems -If a server goes down,
part of the user population cannot access
their mailboxes
9
Functional Homogeneity

Any node can perform any task
Interaction with mail clients
Mail storage
Any piece of data can be managed at any node
Techniques
Replication reconfiguration
Gracefully survive failures
Load balancing
Masking of skews in workload cluster
configuration
Dynamic task data placement decisions
Messages for a single user can be scattered on
multiple nodes collected only upon request

10
Architecture
Protocol handling
User lookup
Load balancing
Message store access
11
Operation

Incoming request send msg to userX
DNS/RR selection of node (A)
Who manages userX ? (B)
A issues request to B for user verification
B knows where userXs messages are kept (C, D)
A picks best node for the new msg (D)
D stores the new msg

Each user is managed by a node and all nodes
must agree on who is managed where
Partitioning of user population using a hash
function
12
Strategy for scalable performance

Avoid creating hot spots
Partition data uniformly among nodes
Fine-grain data partition
Experimental results
30-node cluster (PCs with Linux)
Synthetic load
derived from University servers log
Comparison with sendmailpopd
Sustains 800 msgs/sec (68 M msgs/day)
As compared to 250 msgs/sec (25 M msgs/day)

13
Strategy for reconfiguration (I)

Hard state messages, user profiles
Fine-grain optimistic replication
Soft state user map, mail map
Reconstructed after reconfiguration
Membership protocol
initiated when a crash is detected
Update of user map data structures
Broadcast of updated user map
Distributed disk scan
Each node scans local FS for msgs owned by
moved users

Total amount of mail map info. that needs to be
recovered from disks is equal to that stored on
the crashed node -gt independent of cluster size
14
Strategy for reconfiguration (II)
15
Hard-state replication (I)

Internet semantics
Optimistic, eventually consistent replication
Per-message, per-user profile replication
Small window of inconsistency
A user may see stale data
Efficient during normal operation
For each request, a coordinator node pushes
updates to other nodes
If another node crashes, the coordinator simply
waits for its recovery to complete the update
But does not block
Coordinator crash
A replica node will detect it take over

16
Hard-state replication (II)
Less than linear degradation -gt Disk logging
overhead - can be reduced by using a separate
disk (or NVRAM)
17
Strategy for load balancing

Deciding where to store msgs
Spread soft limit on nodes per mailbox
This limit is violated when nodes crash
Select node from spread candidates
Small spread ? better data affinity
Smaller mail map data structure
More streamlined disk head movement
High spread ? better load balancing
More choices for selection
Load measure pending I/O operations

18
Handling heterogeneity
Better utilization of fast disks (x3 speed)
19
Inktomi

Derivative of the NOW project at UCB
Led to commercial services
TranSend, HotBot
The case against distributed systems ?
BASE semantics, instead of ACID
Basically available
Tolerate (occasionally) stale data
Soft state
Reconstructed during recovery
Eventually consistent
Responses can be approximate
Centralized work queues
Scalable !

20
Why not ACID ?

Much of the data in a network service can
tolerate guarantees weaker than ACID
ACID makes no guarantees about availability
Indeed, it is preferable for an ACID service to
be unavailable than to relax the ACID
constraints
ACID is well suited for
Commerce Txs, billing users, maintainance of
user profiles,
For most Internet information services, the users
value availability more than strong consistency
or durability
Web servers, search/aggregation servers,
caching/transformation proxies,

21
Cluster architecture (I)

Front-ends
Supervision of incoming requests
Matching requests with profiles (customization
DB)
Enqueue requests for service by one or more
workers
Worker pool
Caches service-specific modules
Customization DB
Manager
Balancing load across workers, spawning more
workers as load fluctuates or faults occur
System Area Network (SAN)
Graphical Monitor

22
Cluster architecture (II)
caches
Worker stub
23
Networked, commodity workstations

Incremental growth
Automated, centralized administration and
monitoring.
Boot image management
Software load management
Firewall settings
Console access
Application configuration
Convenient for supporting a system
componentconsider cost of decomposition versus
efficiencies of SMP
Partial failure
Shared state
Distributed shared memory abstraction

24
3-layer decomposition

Service
User interface to control service
Device-specific presentation
Allow workers to remain stateless
TACC API
Transformation
Filtering, transcoding, re-rendering, encryption,
compression
Aggregation
Composition (pipeline chaining) of stateless
modules
Caching
Original, post-aggregation, post-transformation
data
Customization
SNS Scalable Network Service support
Worker load balancing, overflow management
Fault-tolerance
System monitoring logging
Incremental scalability

25
Load management

Load balancing hints
Computed by Manager, based on load measurements
from workers
Periodically transmitted to front-ends
Overflow pool
Absorb load bursts
Relatively rare, but prolonged
Eg Pathfinder landing on Mars -gt 220 M hits, in
4 days
Spare machines on which the Manager can spawn
workers on demand
Workers are interchangeable

26
Fault tolerance availability

Construct robust entities by relying on cached
soft state, refreshed by periodic messages
Transient component failures are a fact of life
Process peer fault tolerance
When a component fails, one of its peers restarts
it (possibly on a different node)
In the meantime, cached state (possibly stale) is
still available to the surviving components
A restarted component gradually reconstructs its
soft state
Typically by listening to multicasts from others
Not the same as process pair
Requires hard state
Timeouts to infer failures

27
The HotBot search engine

SPARCstations, interconnected via Myrinet
Front-ends
50-80 threads per node
dynamic HTML (TCL script tags)
Load balancing
Static partition of search DB
Every query goes to all workers in parallel
Workers are not 100 interchangeable
Each worker has a local disk
Version 1 DB fragments are cross-mounted
So that other nodes can reach the data, with
graceful performance degradation
Version 2 RAID
26 nodes loss of 1 node resulted in the
available DB dropping from 54 M documents to 51
M
Informix DB for user profiles ad revenue
tracking
Primary/backup failover

28
TranSend

Caching transformation proxy
SPARCstations interconnected via 10 Mb/s switched
Ethernet dialup pool
Thread per TCP connection
Single front-end, with a total of 400 threads
Pipelining of distillers
Lossy-compression workers
Centralized Manager
Periodic IP multicast to announce its presence
No static binding is required for workers
Workers periodically report load metric
distillers queue length, weighted by a factor
reflecting the expected execution cost
Version 1 process-pair recovery
Version 2 soft state with watcher process
periodic beacon of state updates

29
Internet service workloads

Yahoo 625 M page views / day
HTML 7 KB, Images 10 KB
AOLs proxy 5.2 B requests / day
Response size 5.5 KB
Services often take 100s of millisecs
Responses take several seconds to flow back
High task throughput non-neglible latency
A service may have to sustain 1000s of
simultaneous tasks
C10K problem
Human users 4 parallel HTTP/GET requests
spawned per page view
A large of service tasks are independent of
each other

30
DDS (S.D. Gribble)

Self-managing, cluster-based data repository
Seen by services as a conventional data structure
Log, tree, hash table
High performance
60 K reads/sec, over 1.28 TB of data
128-node cluster
The CAP principle
A system can have at most two of the following
properties
Consistency
Availability
Tolerance to network Partitions

31
CAP trade-offs
32
SEDA (M. Welsh)

Staged, event-driven architecture
Service stages, linked via queues
Thread pool per stage
Massive concurrency
Admission priority control on each individual
queue
Adaptive load balancing
Feedback loop
No a-priori resource limits

33
Overhead of concurrency (I)
34
Overhead of concurrency (II)
35
SEDA architecture (I)
36
SEDA architecture (II)
37
SEDA architecture (III)
38
Clusters for Internet Services

Previous observation (TACC, Inktomi, NOW)
Clusters of workstations are a natural platform
for constructing Internet services
Internet service properties
support large, rapidly growing user populations
must remain highly available, and cost-effective
Clusters offer a tantalizing solution
incremental scalability cluster grows with
service
natural parallelism high performance platform
software and hardware redundancy fault-tolerance

39
Software troubles

Internet service construction on clusters is hard
load balancing, process management,
communications abstractions, I/O balancing,
fail-over and restart,
toolkits proposed to help (TACC, AS1, River, )
Even harder if shared, persistent state is
involved
data partitioning, replication, and consistency,
interacting with storage subsystem,
solutions not geared to clustered services
use (distributed) RDBMS expensive, powerful
semantic guarantees, generality at cost of
performance
use network/distributed FS overly general, high
overhead (e.g. double buffering penalties).
Fault-tolerance?
Roll-your-own custom solution not reusable,
complex

40
Idea / Hypothesis

It is possible to
isolate clustered services from vagaries of state
mgmt.,
to do so with adequately general abstractions,
to build those abstractions in a layered fashion
(reuse),
and to exploit clusters for performance, and
simplicity.
Scalable Distributed Data Structures (SDDS)
take conventional data structure
hash table, tree, log,
partition it across nodes in a cluster
parallel access, scalability,
replicate partitions within replica groups in
cluster
availability in face of failures, further
parallelism
store replicas on disk

41
Why SDDS?

Fundamental software engineering principle
Separation of concerns
decouple persistency/consistency logic from rest
of service
simpler (and cleaner!) service implementations
Service authors understand data structures
familiar behavior and interfaces from single-node
case
should enable rapid development of new services
Structure access patterns are self-evident
access granularity manifestly a structure element
coincidence of logical and physical data units
cf. file systems, SQL in RDBMS, VM pages in DSM

42
SDDS Challenges

Overcoming complexities of distributed systems
data consistency, data distribution, request load
balancing, hiding network latency and OS
overhead,
ace up the sleeve cluster ? wide area
single, controlled administrative domain
engineer to (probabilistically) avoid network
partitions
use low-latency, high-throughput SAN (5 µs,
40-120 MB/s)
predictable behavior, controlled heterogeneity
I/O is still a problem
Plenty of work on fast network I/O
some on fast disk I/O
Less work bridging network ? disk I/O in cluster
environment

Segment-based cluster I/O layer Filtered
streams bet. Disks, network, memory
43
Segment layer (motivation)

Its all about disk bandwidth avoiding seeks
8 ms random seek, 25-80 MB/s throughput
must read 320 KB per seek to break even
Build disk abstraction layer based on segments
1-2 MB regions on disk, read and written in their
entirety
force upper layers to design with this in mind
small reads/writes treated as uncommon failure
case
SAN throughput is comparable to disk throughput
Stream from disk to network saturate both
channels
stream through service-specific filter functions
selection, transformation,
Apply lessons from high-performance networks

44
Taxonomy of Clustered Services
Stateless
Soft-state
Persistent State

high availability and
completeness
perhaps consistency
persistence necessary

high availability
perhaps consistency
persistence is an
optimization

State Mgmt. Requirements
Little or none
TACC distillers
TACC aggregators
Inktomi search engine
River modules
AS1 servents or RMX
Parallelisms
Examples
Scalable PIM apps
Video Gateway
Squid web cache
HINDE mint
45
Clustering

Goal
Take a cluster of commodity workstations make
them look like a supercomputer.
Problems
Application structure
Partial failure management
Interconnect technology
System administration

46
Cluster Prehistory Tandem NonStop

Early (1974) foray into transparent fault
tolerance through redundancy
Mirror everything (CPU, storage, power supplies)
can tolerate any single fault (later processor
duplexing)
Hot standby process pair approach
Whats the difference between high availability
fault tolerance?
Noteworthy
Shared nothing--why?
Performance and efficiency costs?
Later evolved into Tandem Himalaya
used clustering for both higher performance
higher availability

47
Pre-NOW Clustering in the 90s

IBM Parallel Sysplex and DEC OpenVMS
Targeted at conservative (read mainframe)
customers
Shared disks allowed under both (why?)
All devices have cluster-wide names (shared
everything?)
1500 installations of Sysplex, 25,000 of OpenVMS
Cluster
Programming the clusters
All System/390 and VAX VMS subsystems were
rewritten to be cluster-aware
OpenVMS cluster support exists even in
single-node OS!
An advantage of locking into proprietary
interfaces
What about fault tolerance?

48
The Case For NOW MPPs a Near Miss

Uniprocessor performance improves by 50 / yr
(4/month)
1 year lagWS 1.50 MPP node perf.
2 year lagWS 2.25 MPP node perf.
No economy of scale in 100s gt
Software incompatibility (OS apps) gt
More efficient utilization of compute resources
statistical multiplexing
Scale makes availability affordable (Pfister)
Which of these do commodity clusters actually
solve?

49
Philosophy Systems of Systems

Higher Order systems research
aggressively use off-the-shelf hardware OS
software
Advantages
easier to track technological advances
less development time
easier to transfer technology (reduce lag)
New challenges (the case against NOW)
maintaining performance goals
system is changing underneath you
underlying system has other people's bugs
underlying system is poorly documented

50
Clusters Enhanced Standard Litany

Software engineering
Partial failure management
Incremental scalability
System administration
Heterogeneity

Hardware redundancy
Aggregate capacity
Incremental scalability
Absolute scalability
Price/performance sweet spot

51
Clustering Internet Services

Aggregate capacity
TB of disk storage, THz of compute power
If we only we could harness it in parallel!
Redundancy
Partial failure behavior only small fractional
degradation from loss of one node
Availability industry average across large
sites during 1998 holiday season was 97.2
availability (source CyberAtlas)
Compare mission-critical systems have four
nines (99.99)

52
Spike Absorption

Internet traffic is self-similar
Bursty at all granularities less than about 24
hours
Whats bad about burstiness?
Spike Absorption
Diurnal variation
Peak vs. average demand typically a factor of 3
or more
Starr Report CNN peaked at 20M hits/hour
(compared to usual peak of 12M hits/hour thats
66)
Really the holy grail capacity on demand
Is this realistic?

53
Diurnal Cycle (UCB dialups, Jan. 1997)

750 modems at UC Berkeley
Instrumented early 1997

54
Clustering Internet Workloads

Internet vs. traditional workloads
e.g. Database workloads (TPC benchmarks)
e.g. traditional scientific codes (matrix
multiply, simulated annealing and related
simulations, etc.)
Some characteristic differences
Read mostly
Quality of service (best-effort vs. guarantees)
Task granularity
Embarrassingly parallel
but are they balanced?

55
Meeting the Cluster Challenges

Software programming models
Partial failure and application semantics
System administration

56
Software Challenges (I)

Message-passing Active Messages
Shared memory Network RAM
CC-NUMA, Software DSM
MP vs SM a long-standing religious debate
Arbitrary object migration (network
transparency)
What are the problems with this?
Hints RPC, checkpointing, residual state

57
Software Challenges (II)

Real issue we have to think differently about
programming
to harness clusters?
to get decent failure semantics?
to really exploit software modularity?
Traditional uniprocessor programming
idioms/models dont seem to scale up to clusters
Question Is there a natural to use cluster
model that scales down to uniprocessors?
If so, is it general or application-specific?
What would be the obstacles to adopting such a
model?

58
Partial Failure Management

What does partial failure mean for
a transactional database?
A read-only database striped across cluster
nodes?
A compute-intensive shared service?
What are appropriate partial failure
abstractions?
Incomplete/imprecise results?
Longer latency?
What current programming idioms make partial
failure hard?

59
Cluster System Administration (I)

Total cost of ownership (TCO) way high for
clusters
Median sysadmin cost per machine per year (1996)
700
Cost of a headless workstation today 1500
Previous Solutions
Pay someone to watch
Ignore or wait for someone to complain
Shell Scripts From Hell
not general ? vast repeated work
Need an extensible and scalable way to automate
the gathering, analysis, and presentation of data

60
Cluster System Administration (II)

Extensible Scalable Monitoring For Clusters of
Computers (Anderson Patterson, UC Berkeley)
Relational tables allow properties queries of
interest to evolve as the cluster evolves
Extensive visualization support allows humans to
make sense of masses of data
Multiple levels of caching decouple data
collection from aggregation
Data updates can be pulled on demand or
triggered by push

61
References

Y. Saito, B.N. Bershad and H.M. Levy,
Manageability, availability and performance in
Porcupine a highly scalable, cluster-based mail
service, Proc. 17th ACM SOSP, 1999.
S.D. Gribble, E.A. Brewer, J.M. Hellerstein, and
D. Culler, Scalable, distributed data
structures for Internet service construction,
Proc. 4th OSDI, 2000.
A. Fox, S.D. Gribble, Y. Chawathe, E.A. Brewer,
P. Gauthier, Cluster-based scalable network
services, Proc. 16th ACM SOSP, 1997.

Write a Comment

User Comments (0)

About PowerShow.com

CS556: Distributed Systems - PowerPoint PPT Presentation

CS556: Distributed Systems

Airlines, travel agencies, ... Descendant of CRS (Computerized Reservation System) ... 59 K travel agencies. 450 airlines, 53 K hotels ... – PowerPoint PPT presentation