Title: CS556: Distributed Systems
1CS-556 Distributed Systems
Clusters for Internet Services
- Manolis Marazakis
- maraz_at_csd.uoc.gr
2The HotBot search engine
- SPARCstations, interconnected via Myrinet
- Front-ends
- 50-80 threads per node
- dynamic HTML (TCL script tags)
- Load balancing
- Static partition of search DB
- Every query goes to all workers in parallel
- Workers are not 100 interchangeable
- Each worker has a local disk
- Version 1 DB fragments are cross-mounted
- So that other nodes can reach the data, with
graceful performance degradation - Version 2 RAID
- 26 nodes loss of 1 node resulted in the
available DB dropping from 54 M documents to 51
M - Informix DB for user profiles ad revenue
tracking - Primary/backup failover
3Internet service workloads
- Yahoo 625 M page views / day
- HTML 7 KB, Images 10 KB
- AOLs proxy 5.2 B requests / day
- Response size 5.5 KB
- Services often take 100s of millisecs
- Responses take several seconds to flow back
- High task throughput non-neglible latency
- A service may have to sustain 1000s of
simultaneous tasks - C10K problem
- Human users 4 parallel HTTP/GET requests
spawned per page view - A large of service tasks are independent of
each other
4Clustering Holy Grail
- Goal
- Take a cluster of commodity workstations make
them look like a supercomputer. - Problems
- Application structure
- Partial failure management
- Interconnect technology
- System administration
5Cluster Prehistory Tandem NonStop
- Early (1974) foray into transparent fault
tolerance through redundancy - Mirror everything (CPU, storage, power supplies)
- can tolerate any single fault (later processor
duplexing) - Hot standby process pair approach
- Whats the difference between high availability
fault tolerance? - Noteworthy
- Shared nothing--why?
- Performance and efficiency costs?
- Later evolved into Tandem Himalaya
- used clustering for both higher performance
higher availability
6Pre-NOW Clustering in the 90s
- IBM Parallel Sysplex and DEC OpenVMS
- Targeted at conservative (read mainframe)
customers - Shared disks allowed under both (why?)
- All devices have cluster-wide names (shared
everything?) - 1500 installations of Sysplex, 25,000 of OpenVMS
Cluster - Programming the clusters
- All System/390 and/or VAX VMS subsystems were
rewritten to be cluster-aware - OpenVMS cluster support exists even in
single-node OS! - An advantage of locking into proprietary
interfaces - What about fault tolerance?
7The Case For NOW MPPs a Near Miss
- Uniprocessor performance improves by 50 / yr
(4/month) - 1 year lagWS 1.50 MPP node perf.
- 2 year lagWS 2.25 MPP node perf.
- No economy of scale in 100s gt
- Software incompatibility (OS apps) gt
- More efficient utilization of compute resources
- statistical multiplexing
- Scale makes availability affordable (Pfister)
- Which of these do commodity clusters actually
solve?
8Philosophy Systems of Systems
- Higher Order systems research
- aggressively use off-the-shelf hardware OS
software - Advantages
- easier to track technological advances
- less development time
- easier to transfer technology (reduce lag)
- New challenges (the case against NOW)
- maintaining performance goals
- system is changing underneath you
- underlying system has other people's bugs
- underlying system is poorly documented
9Clusters Enhanced Standard Litany
- Software engineering
- Partial failure management
- Incremental scalability
- System administration
- Heterogeneity
- Hardware redundancy
- Aggregate capacity
- Incremental scalability
- Absolute scalability
- Price/performance sweet spot
10Clustering Internet Services
- Aggregate capacity
- TB of disk storage, THz of compute power
- If we only we could harness it in parallel!
- Redundancy
- Partial failure behavior only small fractional
degradation from loss of one node - Availability industry average across large
sites during 1998 holiday season was 97.2
availability (source CyberAtlas) - Compare mission-critical systems have four
nines (99.99)
11Spike Absorption
- Internet traffic is self-similar
- Bursty at all granularities less than about 24
hours - Whats bad about burstiness?
- Spike Absorption
- Diurnal variation
- Peak vs. average demand typically a factor of 3
or more - Starr Report CNN peaked at 20M hits/hour
(compared to usual peak of 12M hits/hour thats
66) - Really the holy grail capacity on demand
- Is this realistic?
12Diurnal Cycle (UCB dialups, Jan. 1997)
- 750 modems at UC Berkeley
- Instrumented early 1997
13Clustering Internet Workloads
- Internet vs. traditional workloads
- e.g. Database workloads (TPC benchmarks)
- e.g. traditional scientific codes (matrix
multiply, simulated annealing and related
simulations, etc.) - Some characteristic differences
- Read mostly
- Quality of service (best-effort vs. guarantees)
- Task granularity
- Embarrassingly parallel
- but are they balanced? (well return to this
later)
14Meeting the Cluster Challenges
- Software programming models
- Partial failure and application semantics
- System administration
15Software Challenges (I)
- Message-passing Active Messages
- Shared memory Network RAM
- CC-NUMA, Software DSM
- MP vs SM a long-standing religious debate
- Arbitrary object migration (network
transparency) - What are the problems with this?
- Hints RPC, checkpointing, residual state
16Partial Failure Management
- What does partial failure mean for
- a transactional database?
- A read-only database striped across cluster
nodes? - A compute-intensive shared service?
- What are appropriate partial failure
abstractions? - Incomplete/imprecise results?
- Longer latency?
- What current programming idioms make partial
failure hard?
17Software Challenges (II)
- Real issue we have to think differently about
programming - to harness clusters?
- to get decent failure semantics?
- to really exploit software modularity?
- Traditional uniprocessor programming
idioms/models dont seem to scale up to clusters - Question Is there a natural to use cluster
model that scales down to uniprocessors? - If so, is it general or application-specific?
- What would be the obstacles to adopting such a
model?
18Cluster System Administration (I)
- Total cost of ownership (TCO) way high for
clusters - Median sysadmin cost per machine per year (1996)
700 - Cost of a headless workstation today 1500
- Previous Solutions
- Pay someone to watch
- Ignore or wait for someone to complain
- Shell Scripts From Hell
- not general ? vast repeated work
- Need an extensible and scalable way to automate
the gathering, analysis, and presentation of data
19Cluster System Administration (II)
- Extensible Scalable Monitoring For Clusters of
Computers (Anderson Patterson, UC Berkeley) - Relational tables allow properties queries of
interest to evolve as the cluster evolves - Extensive visualization support allows humans to
make sense of masses of data - Multiple levels of caching decouple data
collection from aggregation - Data updates can be pulled on demand or
triggered by push
20Visualizing Data Example
- Display aggregates of various interesting machine
properties on the NOWs - Note use of aggregation color
21SDDS (S.D. Gribble)
- Self-managing, cluster-based data repository
- Seen by services as a conventional data structure
- Log, tree, hash table
- High performance
- 60 K reads/sec, over 1.28 TB of data
- 128-node cluster
- The CAP principle
- A system can have at most two of the following
properties - Consistency
- Availability
- Tolerance to network Partitions
22CAP trade-offs
23Clusters for Internet Services
- Previous observation (TACC, Inktomi, NOW)
- Clusters of workstations are a natural platform
for constructing Internet services - Internet service properties
- support large, rapidly growing user populations
- must remain highly available, and cost-effective
- Clusters offer a tantalizing solution
- incremental scalability cluster grows with
service - natural parallelism high performance platform
- software and hardware redundancy fault-tolerance
24Software troubles
- Internet service construction on clusters is hard
- load balancing, process management,
communications abstractions, I/O balancing,
fail-over and restart, - toolkits proposed to help (TACC, AS1, River, )
- Even harder if shared, persistent state is
involved - data partitioning, replication, and consistency,
interacting with storage subsystem, - solutions not geared to clustered services
- use (distributed) RDBMS expensive, powerful
semantic guarantees, generality at cost of
performance - use network/distributed FS overly general, high
overhead (e.g. double buffering penalties).
Fault-tolerance? - Roll-your-own custom solution not reusable,
complex
25Idea / Hypothesis
- It is possible to
- isolate clustered services from vagaries of state
mgmt., - to do so with adequately general abstractions,
- to build those abstractions in a layered fashion
(reuse), - and to exploit clusters for performance, and
simplicity. - Scalable Distributed Data Structures (SDDS)
- take conventional data structure
- hash table, tree, log,
- partition it across nodes in a cluster
- parallel access, scalability,
- replicate partitions within replica groups in
cluster - availability in face of failures, further
parallelism - store replicas on disk
26Why SDDS?
- Fundamental software engineering principle
- Separation of concerns
- decouple persistency/consistency logic from rest
of service - simpler (and cleaner!) service implementations
- Service authors understand data structures
- familiar behavior and interfaces from single-node
case - should enable rapid development of new services
- Structure access patterns are self-evident
- access granularity manifestly a structure element
- coincidence of logical and physical data units
- cf. file systems, SQL in RDBMS, VM pages in DSM
27SDDS Challenges
- Overcoming complexities of distributed systems
- data consistency, data distribution, request load
balancing, hiding network latency and OS
overhead, - ace up the sleeve cluster ? wide area
- single, controlled administrative domain
- engineer to (probabilistically) avoid network
partitions - use low-latency, high-throughput SAN (5 µs,
40-120 MB/s) - predictable behavior, controlled heterogeneity
- I/O is still a problem
- Plenty of work on fast network I/O
- some on fast disk I/O
- Less work bridging network ? disk I/O in cluster
environment
Segment-based cluster I/O layer Filtered
streams bet. Disks, network, memory
28Prototype hash table
- Storage bricks provide local, network-accessible
hash tables - Interaction with distrib. hash table through
abstraction libraries - C, Java APIs available
- partitioning, mirrored replication logic in each
library - Distrib. table semantics
- handles node failures
- no consistency
- or transactions, on-line recovery, etc.
29Storage bricks
Argument marshalling
Worker pool one thread dispatched per request.
Local hash table implementations
Messaging, event queue
MMAP region management, and alloc(), free() impl.
Transport specific comm. and naming
storage brick
- Individual nodes are storage bricks
- consistent, atomic, network accessible
operations on a local hash table - uses MMAP to handle data persistence
- no transaction support
- Clients communicate to set of storage bricks
using RPC marshalling layer
Service application logic
Virtual to physical node names, inter-node hashing
Service Frontend
30Parallelisms service
- Provides relevant site information given a URL
- an inversion of Yahoo! directory
- Parallelisms builds index of all URLs, returns
other URLs in same topics - read-mostly traffic, nearly no consistency
requirements - large database of URLs
- 1 GB of space for 1.5 million URLs and 80000
topics - Service FE itself is very simple
- 400 semicolons of C
- 130 for app-specific logic
- 270 for threads, HTTP munging,
- hash table code 4K semicolons of C
http//ninja.cs.berkeley.edu/demos/ paralllelisms
/parallelisms.html
31Lessons Learned (I)
- mmap() simplified implementation, but at a price
- service working sets naturally apply
- No pointers breaks usual linked list and hash
table libraries - Little control over the order of writes, so
cannot guarantee consistency if crashes occur - If node goes down, may incur a lengthy sync
before restart - Same for abstraction libraries simplicity with a
cost - Each storage brick could be totally independent
- because policy is embedded in abstraction
libraries - Bad for administration monitoring
- No place to hook in to get view of complete
table - Each client makes isolated decisions
- load balancing and failure detection
32Lessons Learned (II)
- Service simplicity premise seems valid
- Parallelisms service code devoid of persistence
logic - Parallelisms front-ends contain only session
state - No recovery necessary if they fail
- Interface selection is critical
- Originally, just supported put(), get(), remove()
- Wanted to support java.util.hashtable subclass
- Needed enumerations, containsKey(),
containsObject() - Significant re-plumbing required to efficiently
support these ! - Thread subsystem was troublesome
- JDK has its own, and it conflicted. Had to
remove threads from client-side abstraction
library.
33SDDS goal simplicity
- Hypothesis simplify construction of services
- evidence Parallelisms
- distributed hash table prototype 3000 lines of
C code - service 400 lines of C code, 1/3 of which is
service-specific - evidence Keiretsu service
- instant messaging service between heterogeneous
devices - crux of service is in sharing of binding/routing
state - original 131 lines of Java SDDS version 80
lines of Java - Management/operational aspects
- To be successful, authors must want to adopt
SDDSs - simple to incorporate and understand
- operational management must be nearly transparent
- node fail-over and recovery, logging, etc. behind
the scenes - plug-n-play extensibility to add capacity
34SDDS goal generality
- Potential criticism of SDDSs
- No matter which structures you provide, some
services simply cant be built with only those
primitives - response pick a basis to enable many interesting
services - Log, Hash Table, and Tree our guess at a good
basis - Layered-model will allow people to develop other
SDDSs - allow GiST-style specialization hooks?
35SDDS Ideas on Consistency (I)
- Consistency / performance tradeoffs
- stricter consistency requirements imply worse
performance - we know some intended services have weaker
requirements - Rejected alternatives
- built strict consistency, and force people to use
- investigate extended transaction models
- SDDS choice
- Pick a small set of consistency guarantees
- level 0 (atomic but not isolated operations)
- level 3 (ACID)
36SDDS Ideas on Consistency (II)
- Replica management
- what mechanism will we use between replicas?
- 2 phase commit for distributed atomicity
- log-based on-line recovery
- Exploiting cluster properties
- Low network latency ? fast 2 phase commit
- especially relative to WAN latency for Internet
services - Given good UPS, node failures are independent
- commit to memory of peer in group, not to disk
- (probabilistically) engineer away network
partitions - unavailable ? failure
- therefore consensus algorithm not needed
37SDDS Ideas on load management
- Data distribution affects request distribution
- Start simple static data distribution
- Given request, lookup or hash to determine
partition - Optimizations
- locality aware request dist. (LARD) within
replicas - if no failures, replicas further partition data
in memory - front ends often colocated with storage nodes
- front end selection based on data distribution
knowledge - smart clients (Ninja redirector stubs..?)
- Issues
- graceful degradation RED/LRP techniques to drop
requests - given many simultaneous requests, what should be
the service ordering policy?
38Incremental Scalability (I)
- Logs and trees have a natural solution
- pointers are ingrained in these structures
- use the pointers to (re)direct structures onto
new nodes
39Incremental Scalability (II)
- Hash table is the tricky one !
- Why? mapping is done by client-side hash
functions - Unless table is chained, no pointers inside hash
structure - Need to change client-side functions to scale
structure - Litwins linear hashing?
- client-side hash function evolves over time
- clients independently discovery when to evolve
functions - Directory-based map?
- move hashing into infrastructure (inefficient)
- or, have infrastructure inform clients when to
change function - AFS-style registration and callbacks?
40Getting the Interfaces Right
- Upper interfaces sufficient generality
- setting the bar for functionality (e.g.
java.util.hashtable) - opportunity reuse of existing software (e.g.
Berkeley DB) - Lower interfaces use a segment-based I/O layer?
- Log, tree natural sequentiality, segments make
sense - Hash table is much more challenging
- Aggregating small, random accesses into large,
sequential ones - Rely on commits to other nodes memory
- periodically dump deltas to disk LFS-style
41Evaluation use real services
- Metrics for success
- 1) measurable reduction in complexity to author
Internet svcs. - 2) widespread adoption of SDDS by Ninja
researchers - 1) Port/reimplement existing Ninja services
- Keiretsu, Ninja Jukebox, the multispace Log
service - explicitly demonstrate code reduction
performance boon - 2) Convince people to use SDDS for new services
- NinjaMail, Service Discovery Service, ICEBERG
services - Challenge operational aspects of SDDS
- goal as simple to use SDDS as single-node,
non-persistent case
42Segment layer (motivation)
- Its all about disk bandwidth avoiding seeks
- 8 ms random seek, 25-80 MB/s throughput
- must read 320 KB per seek to break even
- Build disk abstraction layer based on segments
- 1-2 MB regions on disk, read and written in their
entirety - force upper layers to design with this in mind
- small reads/writes treated as uncommon failure
case - SAN throughput is comparable to disk throughput
- Stream from disk to network saturate both
channels - stream through service-specific filter functions
- selection, transformation,
- Apply lessons from high-performance networks
43Segment layer challenges
- Thread event model
- Lowest level model dictates entire application
stack - dependency on particular thread subsystem is
undesirable - Asynchronous interfaces are essential
- especially for Internet services w/ thousands of
connections - Potential model VIA completion queues
- Reusability for many components
- toughest customer Telegraph DB
- dictate write ordering, be able to roll back
mods for aborts - if content is paged, make sure dont overwrite on
disk - no mmap( ) !
44Segment Implementation Plan
- Two versions planned
- One version using POSIX syscalls and vanilla
filesystem - definitely wont perform well (copies to handle
shadowing) - portable to many platforms
- good for prototyping and getting API right
- One version on Linux with kernel modules for
specialization - I/O-lite style buffer unification
- use VIA or AM for network I/O
- modify VM subsystem for copy-on-write segments,
and/or paging dirty data to separate region
45Related work (I)
- (S)DSM
- structural element is a better atomic unit than
page - fault tolerance as goal
- Distributed/networked FS NFS, AFS, xFS, LFS,
.. - FS more general, has less chance to exploit
structure - often not in clustered environment (except xFS,
Frangipani) - Litwin SDDS LH, LH, RP, RP
- significant overlap in goals
- but little implementation experience
- little exploitation of cluster characteristics
- consistency model not clear
46Related Work (II)
- Distributed Parallel Databases R, Mariposa,
Gamma, - different goal (generality in structure/queries,
xacts) - stronger and richer semantics, but at cost
- both and performance
- Fast I/O research U-Net, AM, VIA, IO-lite,
fbufs, x-kernel, - network and disk subsystems
- main results get OS out of way, avoid
(unnecessary) copies - use results in our fast I/O layer
- Cluster platforms TACC, AS1, River, Beowulf,
Glunix, - harvesting idle resources, process migration,
single-system view
47Taxonomy of Clustered Services
Stateless
Soft-state
Persistent State
- high availability and
- completeness
- perhaps consistency
- persistence necessary
- high availability
- perhaps consistency
- persistence is an
- optimization
State Mgmt. Requirements
Little or none
TACC distillers
TACC aggregators
Inktomi search engine
River modules
AS1 servents or RMX
Parallelisms
Examples
Scalable PIM apps
Video Gateway
Squid web cache
HINDE mint
48Performance
- Bulk-loading of database dominated by disk access
time - Can achieve 1500 inserts per second per node on
100 Mb/s Ethernet cluster, if hash table fits in
memory (dominant cost is messaging layer) - Otherwise, degrades to about 30 inserts per
second (dominant cost is disk write time) - In steady state, all nodes operate primarily out
of memory, as the working set is fully paged in - similar principle to research Inktomi cluster
- handles hundreds of queries per s. on 4 node
cluster w/ 2 FEs
49SEDA (M. Welsh)
- Staged, event-driven architecture
- Service stages, linked via queues
- Thread pool per stage
- Massive concurrency
- Admission priority control on each individual
queue - Adaptive load balancing
- Feedback loop
- No a-priori resource limits
50Overhead of concurrency (I)
51Overhead of concurrency (II)
52SEDA architecture (I)
53SEDA architecture (II)
54SEDA architecture (III)
55References
- S.D. Gribble, E.A. Brewer, J.M. Hellerstein, and
D. Culler, Scalable, distributed data
structures for Internet service construction,
Proc. 4th OSDI, 2000. - M. Welsh, , Proc. SOSP, 2001