CS556: Distributed Systems

About This Presentation

Title:

CS556: Distributed Systems

Description:

Every query goes to all workers in parallel. Workers are not ... to build those abstractions in a layered fashion (reuse) ... an inversion of Yahoo! directory ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 56

Provided by: mar177

Category:

more less

Transcript and Presenter's Notes

Title: CS556: Distributed Systems

1
CS-556 Distributed Systems
Clusters for Internet Services

Manolis Marazakis
maraz_at_csd.uoc.gr

2
The HotBot search engine

SPARCstations, interconnected via Myrinet
Front-ends
50-80 threads per node
dynamic HTML (TCL script tags)
Load balancing
Static partition of search DB
Every query goes to all workers in parallel
Workers are not 100 interchangeable
Each worker has a local disk
Version 1 DB fragments are cross-mounted
So that other nodes can reach the data, with
graceful performance degradation
Version 2 RAID
26 nodes loss of 1 node resulted in the
available DB dropping from 54 M documents to 51
M
Informix DB for user profiles ad revenue
tracking
Primary/backup failover

3
Internet service workloads

Yahoo 625 M page views / day
HTML 7 KB, Images 10 KB
AOLs proxy 5.2 B requests / day
Response size 5.5 KB
Services often take 100s of millisecs
Responses take several seconds to flow back
High task throughput non-neglible latency
A service may have to sustain 1000s of
simultaneous tasks
C10K problem
Human users 4 parallel HTTP/GET requests
spawned per page view
A large of service tasks are independent of
each other

4
Clustering Holy Grail

Goal
Take a cluster of commodity workstations make
them look like a supercomputer.
Problems
Application structure
Partial failure management
Interconnect technology
System administration

5
Cluster Prehistory Tandem NonStop

Early (1974) foray into transparent fault
tolerance through redundancy
Mirror everything (CPU, storage, power supplies)
can tolerate any single fault (later processor
duplexing)
Hot standby process pair approach
Whats the difference between high availability
fault tolerance?
Noteworthy
Shared nothing--why?
Performance and efficiency costs?
Later evolved into Tandem Himalaya
used clustering for both higher performance
higher availability

6
Pre-NOW Clustering in the 90s

IBM Parallel Sysplex and DEC OpenVMS
Targeted at conservative (read mainframe)
customers
Shared disks allowed under both (why?)
All devices have cluster-wide names (shared
everything?)
1500 installations of Sysplex, 25,000 of OpenVMS
Cluster
Programming the clusters
All System/390 and/or VAX VMS subsystems were
rewritten to be cluster-aware
OpenVMS cluster support exists even in
single-node OS!
An advantage of locking into proprietary
interfaces
What about fault tolerance?

7
The Case For NOW MPPs a Near Miss

Uniprocessor performance improves by 50 / yr
(4/month)
1 year lagWS 1.50 MPP node perf.
2 year lagWS 2.25 MPP node perf.
No economy of scale in 100s gt
Software incompatibility (OS apps) gt
More efficient utilization of compute resources
statistical multiplexing
Scale makes availability affordable (Pfister)
Which of these do commodity clusters actually
solve?

8
Philosophy Systems of Systems

Higher Order systems research
aggressively use off-the-shelf hardware OS
software
Advantages
easier to track technological advances
less development time
easier to transfer technology (reduce lag)
New challenges (the case against NOW)
maintaining performance goals
system is changing underneath you
underlying system has other people's bugs
underlying system is poorly documented

9
Clusters Enhanced Standard Litany

Software engineering
Partial failure management
Incremental scalability
System administration
Heterogeneity

Hardware redundancy
Aggregate capacity
Incremental scalability
Absolute scalability
Price/performance sweet spot

10
Clustering Internet Services

Aggregate capacity
TB of disk storage, THz of compute power
If we only we could harness it in parallel!
Redundancy
Partial failure behavior only small fractional
degradation from loss of one node
Availability industry average across large
sites during 1998 holiday season was 97.2
availability (source CyberAtlas)
Compare mission-critical systems have four
nines (99.99)

11
Spike Absorption

Internet traffic is self-similar
Bursty at all granularities less than about 24
hours
Whats bad about burstiness?
Spike Absorption
Diurnal variation
Peak vs. average demand typically a factor of 3
or more
Starr Report CNN peaked at 20M hits/hour
(compared to usual peak of 12M hits/hour thats
66)
Really the holy grail capacity on demand
Is this realistic?

12
Diurnal Cycle (UCB dialups, Jan. 1997)

750 modems at UC Berkeley
Instrumented early 1997

13
Clustering Internet Workloads

Internet vs. traditional workloads
e.g. Database workloads (TPC benchmarks)
e.g. traditional scientific codes (matrix
multiply, simulated annealing and related
simulations, etc.)
Some characteristic differences
Read mostly
Quality of service (best-effort vs. guarantees)
Task granularity
Embarrassingly parallel
but are they balanced? (well return to this
later)

14
Meeting the Cluster Challenges

Software programming models
Partial failure and application semantics
System administration

15
Software Challenges (I)

Message-passing Active Messages
Shared memory Network RAM
CC-NUMA, Software DSM
MP vs SM a long-standing religious debate
Arbitrary object migration (network
transparency)
What are the problems with this?
Hints RPC, checkpointing, residual state

16
Partial Failure Management

What does partial failure mean for
a transactional database?
A read-only database striped across cluster
nodes?
A compute-intensive shared service?
What are appropriate partial failure
abstractions?
Incomplete/imprecise results?
Longer latency?
What current programming idioms make partial
failure hard?

17
Software Challenges (II)

Real issue we have to think differently about
programming
to harness clusters?
to get decent failure semantics?
to really exploit software modularity?
Traditional uniprocessor programming
idioms/models dont seem to scale up to clusters
Question Is there a natural to use cluster
model that scales down to uniprocessors?
If so, is it general or application-specific?
What would be the obstacles to adopting such a
model?

18
Cluster System Administration (I)

Total cost of ownership (TCO) way high for
clusters
Median sysadmin cost per machine per year (1996)
700
Cost of a headless workstation today 1500
Previous Solutions
Pay someone to watch
Ignore or wait for someone to complain
Shell Scripts From Hell
not general ? vast repeated work
Need an extensible and scalable way to automate
the gathering, analysis, and presentation of data

19
Cluster System Administration (II)

Extensible Scalable Monitoring For Clusters of
Computers (Anderson Patterson, UC Berkeley)
Relational tables allow properties queries of
interest to evolve as the cluster evolves
Extensive visualization support allows humans to
make sense of masses of data
Multiple levels of caching decouple data
collection from aggregation
Data updates can be pulled on demand or
triggered by push

20
Visualizing Data Example

Display aggregates of various interesting machine
properties on the NOWs
Note use of aggregation color

21
SDDS (S.D. Gribble)

Self-managing, cluster-based data repository
Seen by services as a conventional data structure
Log, tree, hash table
High performance
60 K reads/sec, over 1.28 TB of data
128-node cluster
The CAP principle
A system can have at most two of the following
properties
Consistency
Availability
Tolerance to network Partitions

22
CAP trade-offs
23
Clusters for Internet Services

Previous observation (TACC, Inktomi, NOW)
Clusters of workstations are a natural platform
for constructing Internet services
Internet service properties
support large, rapidly growing user populations
must remain highly available, and cost-effective
Clusters offer a tantalizing solution
incremental scalability cluster grows with
service
natural parallelism high performance platform
software and hardware redundancy fault-tolerance

24
Software troubles

Internet service construction on clusters is hard
load balancing, process management,
communications abstractions, I/O balancing,
fail-over and restart,
toolkits proposed to help (TACC, AS1, River, )
Even harder if shared, persistent state is
involved
data partitioning, replication, and consistency,
interacting with storage subsystem,
solutions not geared to clustered services
use (distributed) RDBMS expensive, powerful
semantic guarantees, generality at cost of
performance
use network/distributed FS overly general, high
overhead (e.g. double buffering penalties).
Fault-tolerance?
Roll-your-own custom solution not reusable,
complex

25
Idea / Hypothesis

It is possible to
isolate clustered services from vagaries of state
mgmt.,
to do so with adequately general abstractions,
to build those abstractions in a layered fashion
(reuse),
and to exploit clusters for performance, and
simplicity.
Scalable Distributed Data Structures (SDDS)
take conventional data structure
hash table, tree, log,
partition it across nodes in a cluster
parallel access, scalability,
replicate partitions within replica groups in
cluster
availability in face of failures, further
parallelism
store replicas on disk

26
Why SDDS?

Fundamental software engineering principle
Separation of concerns
decouple persistency/consistency logic from rest
of service
simpler (and cleaner!) service implementations
Service authors understand data structures
familiar behavior and interfaces from single-node
case
should enable rapid development of new services
Structure access patterns are self-evident
access granularity manifestly a structure element
coincidence of logical and physical data units
cf. file systems, SQL in RDBMS, VM pages in DSM

27
SDDS Challenges

Overcoming complexities of distributed systems
data consistency, data distribution, request load
balancing, hiding network latency and OS
overhead,
ace up the sleeve cluster ? wide area
single, controlled administrative domain
engineer to (probabilistically) avoid network
partitions
use low-latency, high-throughput SAN (5 µs,
40-120 MB/s)
predictable behavior, controlled heterogeneity
I/O is still a problem
Plenty of work on fast network I/O
some on fast disk I/O
Less work bridging network ? disk I/O in cluster
environment

Segment-based cluster I/O layer Filtered
streams bet. Disks, network, memory
28
Prototype hash table

Storage bricks provide local, network-accessible
hash tables
Interaction with distrib. hash table through
abstraction libraries
C, Java APIs available
partitioning, mirrored replication logic in each
library
Distrib. table semantics
handles node failures
no consistency
or transactions, on-line recovery, etc.

29
Storage bricks
Argument marshalling
Worker pool one thread dispatched per request.
Local hash table implementations
Messaging, event queue
MMAP region management, and alloc(), free() impl.
Transport specific comm. and naming
storage brick

Individual nodes are storage bricks
consistent, atomic, network accessible
operations on a local hash table
uses MMAP to handle data persistence
no transaction support
Clients communicate to set of storage bricks
using RPC marshalling layer

Service application logic
Virtual to physical node names, inter-node hashing
Service Frontend
30
Parallelisms service

Provides relevant site information given a URL
an inversion of Yahoo! directory
Parallelisms builds index of all URLs, returns
other URLs in same topics
read-mostly traffic, nearly no consistency
requirements
large database of URLs
1 GB of space for 1.5 million URLs and 80000
topics
Service FE itself is very simple
400 semicolons of C
130 for app-specific logic
270 for threads, HTTP munging,
hash table code 4K semicolons of C

http//ninja.cs.berkeley.edu/demos/ paralllelisms
/parallelisms.html
31
Lessons Learned (I)

mmap() simplified implementation, but at a price
service working sets naturally apply
No pointers breaks usual linked list and hash
table libraries
Little control over the order of writes, so
cannot guarantee consistency if crashes occur
If node goes down, may incur a lengthy sync
before restart
Same for abstraction libraries simplicity with a
cost
Each storage brick could be totally independent
because policy is embedded in abstraction
libraries
Bad for administration monitoring
No place to hook in to get view of complete
table
Each client makes isolated decisions
load balancing and failure detection

32
Lessons Learned (II)

Service simplicity premise seems valid
Parallelisms service code devoid of persistence
logic
Parallelisms front-ends contain only session
state
No recovery necessary if they fail
Interface selection is critical
Originally, just supported put(), get(), remove()
Wanted to support java.util.hashtable subclass
Needed enumerations, containsKey(),
containsObject()
Significant re-plumbing required to efficiently
support these !
Thread subsystem was troublesome
JDK has its own, and it conflicted. Had to
remove threads from client-side abstraction
library.

33
SDDS goal simplicity

Hypothesis simplify construction of services
evidence Parallelisms
distributed hash table prototype 3000 lines of
C code
service 400 lines of C code, 1/3 of which is
service-specific
evidence Keiretsu service
instant messaging service between heterogeneous
devices
crux of service is in sharing of binding/routing
state
original 131 lines of Java SDDS version 80
lines of Java
Management/operational aspects
To be successful, authors must want to adopt
SDDSs
simple to incorporate and understand
operational management must be nearly transparent
node fail-over and recovery, logging, etc. behind
the scenes
plug-n-play extensibility to add capacity

34
SDDS goal generality

Potential criticism of SDDSs
No matter which structures you provide, some
services simply cant be built with only those
primitives
response pick a basis to enable many interesting
services
Log, Hash Table, and Tree our guess at a good
basis
Layered-model will allow people to develop other
SDDSs
allow GiST-style specialization hooks?

35
SDDS Ideas on Consistency (I)

Consistency / performance tradeoffs
stricter consistency requirements imply worse
performance
we know some intended services have weaker
requirements
Rejected alternatives
built strict consistency, and force people to use
investigate extended transaction models
SDDS choice
Pick a small set of consistency guarantees
level 0 (atomic but not isolated operations)
level 3 (ACID)

36
SDDS Ideas on Consistency (II)

Replica management
what mechanism will we use between replicas?
2 phase commit for distributed atomicity
log-based on-line recovery
Exploiting cluster properties
Low network latency ? fast 2 phase commit
especially relative to WAN latency for Internet
services
Given good UPS, node failures are independent
commit to memory of peer in group, not to disk
(probabilistically) engineer away network
partitions
unavailable ? failure
therefore consensus algorithm not needed

37
SDDS Ideas on load management

Data distribution affects request distribution
Start simple static data distribution
Given request, lookup or hash to determine
partition
Optimizations
locality aware request dist. (LARD) within
replicas
if no failures, replicas further partition data
in memory
front ends often colocated with storage nodes
front end selection based on data distribution
knowledge
smart clients (Ninja redirector stubs..?)
Issues
graceful degradation RED/LRP techniques to drop
requests
given many simultaneous requests, what should be
the service ordering policy?

38
Incremental Scalability (I)

Logs and trees have a natural solution
pointers are ingrained in these structures
use the pointers to (re)direct structures onto
new nodes

39
Incremental Scalability (II)

Hash table is the tricky one !
Why? mapping is done by client-side hash
functions
Unless table is chained, no pointers inside hash
structure
Need to change client-side functions to scale
structure
Litwins linear hashing?
client-side hash function evolves over time
clients independently discovery when to evolve
functions
Directory-based map?
move hashing into infrastructure (inefficient)
or, have infrastructure inform clients when to
change function
AFS-style registration and callbacks?

40
Getting the Interfaces Right

Upper interfaces sufficient generality
setting the bar for functionality (e.g.
java.util.hashtable)
opportunity reuse of existing software (e.g.
Berkeley DB)
Lower interfaces use a segment-based I/O layer?
Log, tree natural sequentiality, segments make
sense
Hash table is much more challenging
Aggregating small, random accesses into large,
sequential ones
Rely on commits to other nodes memory
periodically dump deltas to disk LFS-style

41
Evaluation use real services

Metrics for success
1) measurable reduction in complexity to author
Internet svcs.
2) widespread adoption of SDDS by Ninja
researchers
1) Port/reimplement existing Ninja services
Keiretsu, Ninja Jukebox, the multispace Log
service
explicitly demonstrate code reduction
performance boon
2) Convince people to use SDDS for new services
NinjaMail, Service Discovery Service, ICEBERG
services
Challenge operational aspects of SDDS
goal as simple to use SDDS as single-node,
non-persistent case

42
Segment layer (motivation)

Its all about disk bandwidth avoiding seeks
8 ms random seek, 25-80 MB/s throughput
must read 320 KB per seek to break even
Build disk abstraction layer based on segments
1-2 MB regions on disk, read and written in their
entirety
force upper layers to design with this in mind
small reads/writes treated as uncommon failure
case
SAN throughput is comparable to disk throughput
Stream from disk to network saturate both
channels
stream through service-specific filter functions
selection, transformation,
Apply lessons from high-performance networks

43
Segment layer challenges

Thread event model
Lowest level model dictates entire application
stack
dependency on particular thread subsystem is
undesirable
Asynchronous interfaces are essential
especially for Internet services w/ thousands of
connections
Potential model VIA completion queues
Reusability for many components
toughest customer Telegraph DB
dictate write ordering, be able to roll back
mods for aborts
if content is paged, make sure dont overwrite on
disk
no mmap( ) !

44
Segment Implementation Plan

Two versions planned
One version using POSIX syscalls and vanilla
filesystem
definitely wont perform well (copies to handle
shadowing)
portable to many platforms
good for prototyping and getting API right
One version on Linux with kernel modules for
specialization
I/O-lite style buffer unification
use VIA or AM for network I/O
modify VM subsystem for copy-on-write segments,
and/or paging dirty data to separate region

45
Related work (I)

(S)DSM
structural element is a better atomic unit than
page
fault tolerance as goal
Distributed/networked FS NFS, AFS, xFS, LFS,
..
FS more general, has less chance to exploit
structure
often not in clustered environment (except xFS,
Frangipani)
Litwin SDDS LH, LH, RP, RP
significant overlap in goals
but little implementation experience
little exploitation of cluster characteristics
consistency model not clear

46
Related Work (II)

Distributed Parallel Databases R, Mariposa,
Gamma,
different goal (generality in structure/queries,
xacts)
stronger and richer semantics, but at cost
both and performance
Fast I/O research U-Net, AM, VIA, IO-lite,
fbufs, x-kernel,
network and disk subsystems
main results get OS out of way, avoid
(unnecessary) copies
use results in our fast I/O layer
Cluster platforms TACC, AS1, River, Beowulf,
Glunix,
harvesting idle resources, process migration,
single-system view

47
Taxonomy of Clustered Services
Stateless
Soft-state
Persistent State

high availability and
completeness
perhaps consistency
persistence necessary

high availability
perhaps consistency
persistence is an
optimization

State Mgmt. Requirements
Little or none
TACC distillers
TACC aggregators
Inktomi search engine
River modules
AS1 servents or RMX
Parallelisms
Examples
Scalable PIM apps
Video Gateway
Squid web cache
HINDE mint
48
Performance

Bulk-loading of database dominated by disk access
time
Can achieve 1500 inserts per second per node on
100 Mb/s Ethernet cluster, if hash table fits in
memory (dominant cost is messaging layer)
Otherwise, degrades to about 30 inserts per
second (dominant cost is disk write time)
In steady state, all nodes operate primarily out
of memory, as the working set is fully paged in
similar principle to research Inktomi cluster
handles hundreds of queries per s. on 4 node
cluster w/ 2 FEs

49
SEDA (M. Welsh)

Staged, event-driven architecture
Service stages, linked via queues
Thread pool per stage
Massive concurrency
Admission priority control on each individual
queue
Adaptive load balancing
Feedback loop
No a-priori resource limits

50
Overhead of concurrency (I)
51
Overhead of concurrency (II)
52
SEDA architecture (I)
53
SEDA architecture (II)
54
SEDA architecture (III)
55
References

S.D. Gribble, E.A. Brewer, J.M. Hellerstein, and
D. Culler, Scalable, distributed data
structures for Internet service construction,
Proc. 4th OSDI, 2000.
M. Welsh, , Proc. SOSP, 2001

Write a Comment

User Comments (0)