Title: Adaptive Overload Control for Busy Internet Servers
1Adaptive Overload Control for Busy Internet
Servers
- Matt Welsh and David Culler
- USENIX Symposium on Internet Technologies and
Systems (USITS) - 2003
- Alex Cheung
- Nov 13, 2006
- ECE1747
2Outline
- Motivation
- Goal
- Methodology
- Detection
- Overload control
- Experiments
- Comments
3Motivation
- Internet services becoming important to our daily
lives - Email
- News
- Trading
- Services becoming more complex
- Large dynamic content
- Requires high computation and I/O
- Hard to predict load requirements of requests
- Withstand peak load that is 1000x the norm
without over-provisioning - Solve CNNs problem on 911
4Goal
- Adaptive overload control scheme at node level by
maintaining - Response time
- Throughput
- QoS Availability
5Methodology - Detection
- Look at the 90th percentile response time
- Compare with threshold and decide what to do
- Weaker alternatives
- 100th percentile does not capture shape of
response time curve - Throughput does not capture user perceived
performance of the system - I ask
- What makes 90th percentile so great?
- Why not 95th? 80th? 70th?
- No supporting micro-experiment
Requests served
1
2
3
4
5
6
7
8
9
10
Examine 90th highest response time
6Methodology Overload Control
- If response time is higher than threshold
- Limit service rate by rejecting selected requests
- Extension Differentiate requests with
classes/priorities levels and reject lower
class/priority requests first - Quality/service degradation
- Back pressure
- Queue explosion at 1st stage (they say)
- Solved by rejecting requests at 1st stage
- Breaks the loose-coupling modular design of SEDA
with out-of-band notification scheme (I say)
7Methodology Overload Control
- Forward rejected request to another more
available server. - more available server with the most of a
particular resource - CPU, network, I/O, hard disk
- Make decision using centralized or distributed
algorithm - Reliable state migration, possibly transactional
- My take
- More complex, interesting, and actually solves
CNNs problem with a cluster of servers!
8Rate Limit
SMOOTHED
Multiplicative decrease Additive increase Just
like TCP!
- 10 fine-tuned parameters per stage.
9Rate Limit With Class/Priority
- Class/priority assignment based on
- IP address, header information, HTTP cookies
- I ask
- Where is the priority assignment module
implemented? - Should priority assignment be a stage of its own?
- Is it not shown because complicates the diagram
and makes the stage design not clean? - How to classify which requests are potentially
bottleneck requests? Application dependent?
10Quality/Service Degradation
- Notify application via signal to DO service
degradation. - Application does service degradation, not SEDA
- Questions
- How is the signaling implemented?
- Out of band?
- Is it possible to signal previous stages in the
pipeline? Will this SEDAs loose-coupling design?
signal
11 12Experiments - Setup
- Arashi email server (realistic experiment)
- Real access workload
- Real email content
- Admission control
- Web server benchmark
- Service degradation 1-class admission control
13Experiments Admission Rate
14Experiments Response Time
15Experiments Massive Load Spike
Not fair! SEDAs parameters were fine-tuned.
Apache can be tuned to stay flat too.
16Experiments Service Degradation
17Experiments Service Differentiation
- Average reject rates without service
differentiation - Low-priority 55.5
- High-priority 57.6
- With service differentiation
- Low-priority 87.9 32.4
- High-priority 48.8 -8.8
- Question
- Why is the drop rate for high priority request
reduced so little with service differentiation?
Workload dependent?
18 19Comments
- No idea on what is the controllers overhead
- Overload control at node level is not good
- Node level is inefficient
- Late rejection
- Node level is not user friendly
- All session state is gone if you get a reject out
of the blues ? comes without warning - Need global level overload control scheme
- Idea/concept is explained in 2.5 pages
20Comments
- Rejected requests
- Instead of TCP timeout, send static page.
- (Paper says) this is better
- (I say) This is worst because it leads to a
out-of-memory crash down the road - Saturated output bandwidth
- Boundless queue at reject handler
- Parameters
- How to tune them? How difficult to tune?
- May be tedious tuning each stage manually.
- Given a 1M stage application, need to configure
all 1M stage thresholds manually? - Automated tuning with control theory?
- Methodology of adding extensions is not shown in
any figures.
21Comments
- Experiment is not entirely realistic
- Inter-request think time is 20ms
- realistic?
- Rejected users have to re-login after 5 min
- All state information is gone
- Frustrated users
- Two drawbacks of using response time for load
detection
22Comments
- No idea which resource is the bottleneck CPU?
I/O? Network? Memory? - SEDA can only either
- Do admission control
- Reduces throughput
- Tell application to degrade overall service
23Comments
OVERLOADED!!!
Attach image
Send response
Default admission control
threshold
Resource utilization
CPU I/O Network Memory
Reject requests
24Comments
Service degradation WITH bottleneck intelligence
threshold
Resource utilization
CPU I/O Network Memory
Network is the bottleneck, so expend some CPU and
memory to reduce fidelity and size of images to
reduce bandwidth consumption WITHOUT reducing
admission rate.
25Comments
- The response time index is lagging by at least
the magnitude of the response time - 50 requests come in all at once
- nreq 100
- timeout 10s
- target 20s
- Processing time per request 1s
- Detects overload after 30s
- Solution
- Compare enqueue VS dequeue rate
- Overload occurs when enqueue rate gt dequeue rate
- Detects overload after 10s
26Questions?