Title: OnCall
1OnCall
- Defeating Traffic Spikes with a Free-Market
Application Cluster
James Norris Keith Coleman Armando Fox
George Candea Stanford University
2Motivation
3CNN.com
- September 11, 2001
- 4x traffic in a single day8x traffic on second
day - Offline for 2.5 hours, diminished service
afterwards - Forced to borrow servers from sister AOL-TW
websites -
337.4 M
162.4 M
Page Views
40 M
4Slashdot, etc
- Slashdot Effect
- Knocks out sites (often at the worst possible
time) - Variable Traffic
- Ticket Sales
- Contests
- Online Fashion Shows
- etc
5What to do?
6One Option Overprovision
- Works for steady state fluctuations (but is
it optimal?) - Too expensive for spike conditions (8x
servers for CNN) - Think about it Like having a fixed size buffer
-
- Can only support 1000 entries ? Lame
- Stanford Axess Sorry, 49 people already
logged in - And in steady state there is so much waste
- So what do we do? Use dynamic allocation
7What is OnCall?
- OnCall is
- a cluster management system designed to
multiplex several (possibly competing) dynamic
web applications onto a single cluster. - Goal
- Make spike handling possible while providing
useful resource guarantees to all apps -
8OnCall Overview
- Marketplace of Applications
- Applications rent and lend computing resources
according to pre-defined market policies - Generic Platform
- Based on VMs
- ? application generic
- ? fast app swapping
-
9Marketplace
10Market Rounds
- Offline
- Each application assigned ownership of G
computers at a fixed price (or rate) - Online
- Determine market equilibrium price, P, by
querying each application - Calculate new allocation sizes at price P
- Adjust allocations, moving computers from sellers
to buyers - Repeat every time quantum, t
11Offline Market G
- G
- Each app owns G nodes
- Resource guarantees
- Never have to sell no matter what the price or
what other apps demands, an app is guaranteed
use of its G nodes - Can lend by choice (if there are renters at
desired price) - Can rent extra nodes (if it needs to and/or can
afford to) -
12Online Market
7 5 2 14, but I only have 10 nodes!
5 3 2 10 Perfect!
10 nodes in cluster
Marketplace
Policy
Policy
Policy
13Online Market Policies
Output of computers desired at price P
POLICY
Price P
14Example Market Policy
n lt G (no spike)
- For each round, application A computes the number
of nodes, n, it needs to handle current traffic - Ex Application A has a price threshold of 6
- If (P lt 6), A will ask for n nodes
- If (P 6), A will only ask for min(n, G) nodes
it cant afford to rent extras
n gt G (spike)
15Finding the Equilibrium
- Sample points along the different policy
functions - Determine the price at which the total number of
nodes desired by all apps equals the total number
of nodes available on the cluster
16Notes and Assumptions
- Homogeneity Assumption
- Cluster is assumed to be homogeneousall nodes
rented at same price (for simplicity) - Swapping Costs
- Time delay cost in start up / shut down of an
app on a node. - If a rental contract is renewed, app runs on
same node. - P Only for Extras
- Apps only pay price P for nodes above and beyond
their own G - Ex Using 40, G 30
- ? 40 30 10 nodes at price P
-
17Platform
18Platform Overview
19Runtime Operation
- Runtime cycle repeats every t
- Marketplace calculates equilibrium price (and
thus application allocations) - Managers assigns apps to physical nodes
(minimizing shutdowns and startups) - Manager signals Responders to shutdown and start
new app, as necessary - At end of round, Manager gathers new usage stats
reports stats to Market Policies - Repeat
20Does this work?
21Simulation Testbed
- Three Simulations, Four Traits
- Spike handling under unconstrained resources
- Spike handling under constrained resources
- Resource guarantees
- Fast server activation
- U.C. Berkeley X Cluster
- 30 Nodes (double CNN.com)
- Dual 1 GHz PIII, 1.5 GB RAM
- VMware GSX Server on Linux
22 Sim 1 Spike Handling
- G 10 for both apps
- App 1 handles spikes, App 2 makes
- Notice Lag time between node assigned ? node
active
23 Sim 2 Resource Constraints
- G1 12, G2 6, G3 12
- App 1 has higher budget than App 2, but both
spike - App 1 handles spikes, App 2 sees guarantee, App 3
makes - App 2 buys more when App 1s spike subsides
24 Sim 3 Fast Activation
Platform OnCall Optimal OnCall Limited Standard with OS Standard w/out OS
Time until Active (s) 5-10 50-120 270-330 710-750
- OnCall Optimal Load VMs from suspended state
- OnCall Limited Load VMs from shutdown state
- Standard with OS OS already installed on node
- Standard without OS Must install OS first
- Significance
- Worst case, gt 2x improvement
- When spike lasts only 30 minutes, this is
significant - If you can startup quickly, accurate predictor is
not critical
25More on Markets
26Marketplace Optimality
- What is optimal?
- Under resource constraints, those applications
with the most utility to derive from the use of
additional nodes are given those nodes - Utility Curves
- Curve specifies dollar value an application
derives from possessing a certain number of nodes
for a specific time quantum.
Trivially Utility curves are always
monotonically non-decreasing (i.e. it is never
worse to own more nodes at a given total cost)
To be optimal Marginal utility curves are
always monotonically non-increasing (i.e. every
additional node is worth same or less than one
before)
27Marketplace Fairness
- Markets are optimal if
- they are free and fair
-
- Anti-competitive behavior
- Monopoly/Oligopoly
- Aggressive tactics
- Fairness through Regulation
- Ensure enough distinct owners ? no monopoly
- Fine or ban app that engages in overtly
anti-competitive behavior
28Competitive vs Cooperative
- Competitive Environments
- Ex ASP, where app owners may be in competition
- Cooperative Environments
- Ex Search engine, Yahoogle
- Quick Case Study
- App 1 Paid web search (very high value in low
latency) - App 2 Ad-supported web search (high value in
low latency) - App 3 Crawler (latency OK, starvation not)
- For each app, model utility of running at a
given time -
- Benefit If you add an app, just need to model
that app, not remodel whole system
29Profit Through Efficiency
- Shut Down App
- ASP shuts down servers when it can buy them for
less than the cost of keeping them running (A/C,
utilities, etc) - ASP can then add additional capacity and sell
only when profitable
30Future Work
31Future Work
- VM caching
- Cache VMs to local disk (speculatively or as
read from NAS) -
- Fault tolerance
- Add master-backup fault tolerance to the OnCall
Manager - Performance statistics
- Provide market policies with additional
statistics (e.g. end-to-end response time) - Scalable data layer
- Add support for scalable persistent stores that
would allow replication on the data tier. - Multiplexing
- Study trade-offs of running several applications
on one node
32Questions?