OnCall - PowerPoint PPT Presentation

About This Presentation

Title:

OnCall

Description:

CNN used 15 4-proc Suns Needed 2 computers from Cartoon Network, ... cluster L7 Load Balancers Internet Network Attached Storage containing Application VM ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 33

Provided by: Keith393

Learn more at: https://cs.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: OnCall

1
OnCall

Defeating Traffic Spikes with a Free-Market
Application Cluster

James Norris Keith Coleman Armando Fox
George Candea Stanford University
2
Motivation
3
CNN.com

September 11, 2001
4x traffic in a single day8x traffic on second
day
Offline for 2.5 hours, diminished service
afterwards
Forced to borrow servers from sister AOL-TW
websites

337.4 M
162.4 M
Page Views
40 M
4
Slashdot, etc

Slashdot Effect
Knocks out sites (often at the worst possible
time)
Variable Traffic
Ticket Sales
Contests
Online Fashion Shows
etc

5
What to do?
6
One Option Overprovision

Works for steady state fluctuations (but is
it optimal?)
Too expensive for spike conditions (8x
servers for CNN)
Think about it Like having a fixed size buffer
Can only support 1000 entries ? Lame
Stanford Axess Sorry, 49 people already
logged in
And in steady state there is so much waste
So what do we do? Use dynamic allocation

7
What is OnCall?

OnCall is
a cluster management system designed to
multiplex several (possibly competing) dynamic
web applications onto a single cluster.
Goal
Make spike handling possible while providing
useful resource guarantees to all apps

8
OnCall Overview

Marketplace of Applications
Applications rent and lend computing resources
according to pre-defined market policies
Generic Platform
Based on VMs
? application generic
? fast app swapping

9
Marketplace
10
Market Rounds

Offline
Each application assigned ownership of G
computers at a fixed price (or rate)
Online
Determine market equilibrium price, P, by
querying each application
Calculate new allocation sizes at price P
Adjust allocations, moving computers from sellers
to buyers
Repeat every time quantum, t

11
Offline Market G

G
Each app owns G nodes
Resource guarantees
Never have to sell no matter what the price or
what other apps demands, an app is guaranteed
use of its G nodes
Can lend by choice (if there are renters at
desired price)
Can rent extra nodes (if it needs to and/or can
afford to)

12
Online Market
7 5 2 14, but I only have 10 nodes!
5 3 2 10 Perfect!
10 nodes in cluster
Marketplace
Policy
Policy
Policy
13
Online Market Policies

Inputs

Output of computers desired at price P
POLICY
Price P
14
Example Market Policy
n lt G (no spike)

For each round, application A computes the number
of nodes, n, it needs to handle current traffic
Ex Application A has a price threshold of 6
If (P lt 6), A will ask for n nodes
If (P 6), A will only ask for min(n, G) nodes
it cant afford to rent extras

n gt G (spike)
15
Finding the Equilibrium

Sample points along the different policy
functions
Determine the price at which the total number of
nodes desired by all apps equals the total number
of nodes available on the cluster

16
Notes and Assumptions

Homogeneity Assumption
Cluster is assumed to be homogeneousall nodes
rented at same price (for simplicity)
Swapping Costs
Time delay cost in start up / shut down of an
app on a node.
If a rental contract is renewed, app runs on
same node.
P Only for Extras
Apps only pay price P for nodes above and beyond
their own G
Ex Using 40, G 30
? 40 30 10 nodes at price P

17
Platform
18
Platform Overview
19
Runtime Operation

Runtime cycle repeats every t
Marketplace calculates equilibrium price (and
thus application allocations)
Managers assigns apps to physical nodes
(minimizing shutdowns and startups)
Manager signals Responders to shutdown and start
new app, as necessary
At end of round, Manager gathers new usage stats
reports stats to Market Policies
Repeat

20
Does this work?
21
Simulation Testbed

Three Simulations, Four Traits
Spike handling under unconstrained resources
Spike handling under constrained resources
Resource guarantees
Fast server activation
U.C. Berkeley X Cluster
30 Nodes (double CNN.com)
Dual 1 GHz PIII, 1.5 GB RAM
VMware GSX Server on Linux

22
Sim 1 Spike Handling

G 10 for both apps
App 1 handles spikes, App 2 makes
Notice Lag time between node assigned ? node
active

23
Sim 2 Resource Constraints

G1 12, G2 6, G3 12
App 1 has higher budget than App 2, but both
spike
App 1 handles spikes, App 2 sees guarantee, App 3
makes
App 2 buys more when App 1s spike subsides

24
Sim 3 Fast Activation
Platform OnCall Optimal OnCall Limited Standard with OS Standard w/out OS
Time until Active (s) 5-10 50-120 270-330 710-750

OnCall Optimal Load VMs from suspended state
OnCall Limited Load VMs from shutdown state
Standard with OS OS already installed on node
Standard without OS Must install OS first
Significance
Worst case, gt 2x improvement
When spike lasts only 30 minutes, this is
significant
If you can startup quickly, accurate predictor is
not critical

25
More on Markets
26
Marketplace Optimality

What is optimal?
Under resource constraints, those applications
with the most utility to derive from the use of
additional nodes are given those nodes
Utility Curves
Curve specifies dollar value an application
derives from possessing a certain number of nodes
for a specific time quantum.

Trivially Utility curves are always
monotonically non-decreasing (i.e. it is never
worse to own more nodes at a given total cost)
To be optimal Marginal utility curves are
always monotonically non-increasing (i.e. every
additional node is worth same or less than one
before)
27
Marketplace Fairness

Markets are optimal if
they are free and fair
Anti-competitive behavior
Monopoly/Oligopoly
Aggressive tactics
Fairness through Regulation
Ensure enough distinct owners ? no monopoly
Fine or ban app that engages in overtly
anti-competitive behavior

28
Competitive vs Cooperative

Competitive Environments
Ex ASP, where app owners may be in competition
Cooperative Environments
Ex Search engine, Yahoogle
Quick Case Study
App 1 Paid web search (very high value in low
latency)
App 2 Ad-supported web search (high value in
low latency)
App 3 Crawler (latency OK, starvation not)
For each app, model utility of running at a
given time
Benefit If you add an app, just need to model
that app, not remodel whole system

29
Profit Through Efficiency

Shut Down App
ASP shuts down servers when it can buy them for
less than the cost of keeping them running (A/C,
utilities, etc)
ASP can then add additional capacity and sell
only when profitable

30
Future Work
31
Future Work

VM caching
Cache VMs to local disk (speculatively or as
read from NAS)
Fault tolerance
Add master-backup fault tolerance to the OnCall
Manager
Performance statistics
Provide market policies with additional
statistics (e.g. end-to-end response time)
Scalable data layer
Add support for scalable persistent stores that
would allow replication on the data tier.
Multiplexing
Study trade-offs of running several applications
on one node