OnCall - PowerPoint PPT Presentation

About This Presentation
Title:

OnCall

Description:

CNN used 15 4-proc Suns Needed 2 computers from Cartoon Network, ... cluster L7 Load Balancers Internet Network Attached Storage containing Application VM ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 33
Provided by: Keith393
Learn more at: https://cs.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: OnCall


1
OnCall
  • Defeating Traffic Spikes with a Free-Market
    Application Cluster

James Norris Keith Coleman Armando Fox
George Candea Stanford University
2
Motivation
3
CNN.com
  • September 11, 2001
  • 4x traffic in a single day8x traffic on second
    day
  • Offline for 2.5 hours, diminished service
    afterwards
  • Forced to borrow servers from sister AOL-TW
    websites

337.4 M
162.4 M
Page Views
40 M
4
Slashdot, etc
  • Slashdot Effect
  • Knocks out sites (often at the worst possible
    time)
  • Variable Traffic
  • Ticket Sales
  • Contests
  • Online Fashion Shows
  • etc

5
What to do?
6
One Option Overprovision
  • Works for steady state fluctuations (but is
    it optimal?)
  • Too expensive for spike conditions (8x
    servers for CNN)
  • Think about it Like having a fixed size buffer
  • Can only support 1000 entries ? Lame
  • Stanford Axess Sorry, 49 people already
    logged in
  • And in steady state there is so much waste
  • So what do we do? Use dynamic allocation

7
What is OnCall?
  • OnCall is
  • a cluster management system designed to
    multiplex several (possibly competing) dynamic
    web applications onto a single cluster.
  • Goal
  • Make spike handling possible while providing
    useful resource guarantees to all apps

8
OnCall Overview
  • Marketplace of Applications
  • Applications rent and lend computing resources
    according to pre-defined market policies
  • Generic Platform
  • Based on VMs
  • ? application generic
  • ? fast app swapping

9
Marketplace
10
Market Rounds
  • Offline
  • Each application assigned ownership of G
    computers at a fixed price (or rate)
  • Online
  • Determine market equilibrium price, P, by
    querying each application
  • Calculate new allocation sizes at price P
  • Adjust allocations, moving computers from sellers
    to buyers
  • Repeat every time quantum, t

11
Offline Market G
  • G
  • Each app owns G nodes
  • Resource guarantees
  • Never have to sell no matter what the price or
    what other apps demands, an app is guaranteed
    use of its G nodes
  • Can lend by choice (if there are renters at
    desired price)
  • Can rent extra nodes (if it needs to and/or can
    afford to)

12
Online Market
7 5 2 14, but I only have 10 nodes!
5 3 2 10 Perfect!
10 nodes in cluster
Marketplace
Policy
Policy
Policy
13
Online Market Policies
  • Inputs

Output of computers desired at price P
POLICY
Price P
14
Example Market Policy
n lt G (no spike)
  • For each round, application A computes the number
    of nodes, n, it needs to handle current traffic
  • Ex Application A has a price threshold of 6
  • If (P lt 6), A will ask for n nodes
  • If (P 6), A will only ask for min(n, G) nodes
    it cant afford to rent extras

n gt G (spike)
15
Finding the Equilibrium
  • Sample points along the different policy
    functions
  • Determine the price at which the total number of
    nodes desired by all apps equals the total number
    of nodes available on the cluster

16
Notes and Assumptions
  • Homogeneity Assumption
  • Cluster is assumed to be homogeneousall nodes
    rented at same price (for simplicity)
  • Swapping Costs
  • Time delay cost in start up / shut down of an
    app on a node.
  • If a rental contract is renewed, app runs on
    same node.
  • P Only for Extras
  • Apps only pay price P for nodes above and beyond
    their own G
  • Ex Using 40, G 30
  • ? 40 30 10 nodes at price P

17
Platform
18
Platform Overview
19
Runtime Operation
  • Runtime cycle repeats every t
  • Marketplace calculates equilibrium price (and
    thus application allocations)
  • Managers assigns apps to physical nodes
    (minimizing shutdowns and startups)
  • Manager signals Responders to shutdown and start
    new app, as necessary
  • At end of round, Manager gathers new usage stats
    reports stats to Market Policies
  • Repeat

20
Does this work?
21
Simulation Testbed
  • Three Simulations, Four Traits
  • Spike handling under unconstrained resources
  • Spike handling under constrained resources
  • Resource guarantees
  • Fast server activation
  • U.C. Berkeley X Cluster
  • 30 Nodes (double CNN.com)
  • Dual 1 GHz PIII, 1.5 GB RAM
  • VMware GSX Server on Linux

22
Sim 1 Spike Handling
  • G 10 for both apps
  • App 1 handles spikes, App 2 makes
  • Notice Lag time between node assigned ? node
    active

23
Sim 2 Resource Constraints
  • G1 12, G2 6, G3 12
  • App 1 has higher budget than App 2, but both
    spike
  • App 1 handles spikes, App 2 sees guarantee, App 3
    makes
  • App 2 buys more when App 1s spike subsides

24
Sim 3 Fast Activation
Platform OnCall Optimal OnCall Limited Standard with OS Standard w/out OS
Time until Active (s) 5-10 50-120 270-330 710-750
  • OnCall Optimal Load VMs from suspended state
  • OnCall Limited Load VMs from shutdown state
  • Standard with OS OS already installed on node
  • Standard without OS Must install OS first
  • Significance
  • Worst case, gt 2x improvement
  • When spike lasts only 30 minutes, this is
    significant
  • If you can startup quickly, accurate predictor is
    not critical

25
More on Markets
26
Marketplace Optimality
  • What is optimal?
  • Under resource constraints, those applications
    with the most utility to derive from the use of
    additional nodes are given those nodes
  • Utility Curves
  • Curve specifies dollar value an application
    derives from possessing a certain number of nodes
    for a specific time quantum.

Trivially Utility curves are always
monotonically non-decreasing (i.e. it is never
worse to own more nodes at a given total cost)
To be optimal Marginal utility curves are
always monotonically non-increasing (i.e. every
additional node is worth same or less than one
before)
27
Marketplace Fairness
  • Markets are optimal if
  • they are free and fair
  • Anti-competitive behavior
  • Monopoly/Oligopoly
  • Aggressive tactics
  • Fairness through Regulation
  • Ensure enough distinct owners ? no monopoly
  • Fine or ban app that engages in overtly
    anti-competitive behavior

28
Competitive vs Cooperative
  • Competitive Environments
  • Ex ASP, where app owners may be in competition
  • Cooperative Environments
  • Ex Search engine, Yahoogle
  • Quick Case Study
  • App 1 Paid web search (very high value in low
    latency)
  • App 2 Ad-supported web search (high value in
    low latency)
  • App 3 Crawler (latency OK, starvation not)
  • For each app, model utility of running at a
    given time
  • Benefit If you add an app, just need to model
    that app, not remodel whole system

29
Profit Through Efficiency
  • Shut Down App
  • ASP shuts down servers when it can buy them for
    less than the cost of keeping them running (A/C,
    utilities, etc)
  • ASP can then add additional capacity and sell
    only when profitable

30
Future Work
31
Future Work
  • VM caching
  • Cache VMs to local disk (speculatively or as
    read from NAS)
  • Fault tolerance
  • Add master-backup fault tolerance to the OnCall
    Manager
  • Performance statistics
  • Provide market policies with additional
    statistics (e.g. end-to-end response time)
  • Scalable data layer
  • Add support for scalable persistent stores that
    would allow replication on the data tier.
  • Multiplexing
  • Study trade-offs of running several applications
    on one node

32
Questions?
Write a Comment
User Comments (0)
About PowerShow.com