Title: Latency as a Performability Metric: Experimental Results
1Latency as a Performability Metric Experimental
Results
- Pete Broadwell
- pbwell_at_cs.berkeley.edu
2Outline
- Motivation and background
- Performability overview
- Project summary
- Test setup
- PRESS web server
- Mendosus fault injection system
- Experimental results analysis
- How to represent latency
- Questions for future research
3Performability overview
- Goal of ROC project develop metrics to evaluate
new recovery techniques - Performability class of metrics to describe how
a system performs in the presence of faults - First used in fault-tolerant computing field1
- Now being applied to online services
1 J. F. Meyer, Performability Evaluation Where
It Is and What Lies Ahead, 1994
4Example microbenchmark
RAID disk failure
5Project motivation
- Rutgers study performability analysis of a web
server, using throughput - Other studies (esp. from HP Labs Storage group)
also use response time as a metric - Assertion latency and data quality are better
than throughput for describing user experience - How best to represent latency in performability
reports?
6Project overview
- Goals
- Replicate PRESS/Mendosus study with response time
measurements - Discuss how to incorporate latency into
performability statistics - Contributions
- Provide a latency-based analysis of a web
servers performability (currently rare) - Further the development of more comprehensive
dependability benchmarks
7Experiment components
- The Mendosus fault injection system
- From Rutgers (Rich Martin)
- Goal low-overhead emulation of a cluster of
workstations, injection of likely faults - The PRESS web server
- Cluster-based, uses cooperative caching. Designed
by Carreira et al. (Rutgers) - Perf-PRESS basic version
- HA-PRESS incorporates hearbeats, master node for
automated cluster management - Client simulators
- Submit set of requests/sec, based on real traces
8Mendosus design
Workstations (real or VMs)
Global Controller (Java)
ModifiedNICdriver
SCSImodule
procmodule
Fault config file
LAN emu config file
Apps config file
User-leveldaemon (Java)
apps
Emulated LAN
9Experimental setup
10Fault types
Category Fault Possible Root Cause
Node Node crash Operator error, OS bug, hardware component failure, power outage
Node Node freeze OS or kernel module bug
Application App crash Application bug or resource unavailability
Application App hang Application bug or resource contention with other processes
Network Link down or flaky Broken, damaged or misattached cable
Network Switch down or flaky Damaged or misconfigured switch, power outage
11Test case timeline
- Warm-up time 30-60 seconds - Time to repair
up to 90 seconds
12Simplifying assumptions
- Operator repairs any non-transient failure after
90 seconds - Web page size is constant
- Faults are independent
- Each client request is independent of all others
(no sessions!) - Request arrival times are determined by a Poisson
process (not self-similar) - Simulated clients abandon connection attempt
after 2 secs, give up on page load after 8 secs
13Sample result app crash
Perf-PRESS
HA-PRESS
Throughput
Latency
14Sample result node hang
Perf-PRESS
HA-PRESS
Throughput
Latency
15Representing latency
- Total seconds of wait time
- Not good for comparing cases with different
workloads - Average (mean) wait time per request
- OK, but requires that expected (normal) response
time be given separately - Variance of wait time
- Not very intuitive to describe. Also, read-only
workload means that all variance is toward longer
wait times anyway
16Representing latency (2)
- Consider goodput-based availability total
responses served total requests - Idea Latency-based punctuality ideal total
latency actual total latency - Like goodput, maximum value is 1
- Ideal total latencyaverage latency for
non-fault cases x total requests (shouldnt be 0)
17Representing latency (3)
- Aggregate punctuality ignores brief, severe
spikes in wait time (bad for user experience) - Can capture these in a separate statistic (EX 1
of 100k responses took gt8 sec)
18Availability and punctuality
19Other metrics
- Data quality, latency and throughput are
interrelated - Is a 5-second wait for a response worse than
waiting 1 second to get a try back later? - To combine DQ, latency and throughput, can use a
demerit system (proposed by Keynote)1 - These can be very arbitrary, so its important
that the demerit formula be straightforward and
publicly available
1 Zona Research and Keynote Systems, The Need for
Speed II, 2001
20Sample demerit system
- Rules
- Each aborted (2s) conn 2 demerits
- Each conn error 1 demerit
- Each user timeout (8s) 8 demerits
- Each sec of total latency above ideal level(1
demerit/total requests) x scaling factor
21Online service optimization
Performance metrics throughput, latency data
quality
Environment workload faults
22Conclusions
- Latency-based punctuality and throughput-based
availability give similar results for a read-only
web workload - Applied workload is very important
- Reliability metrics do not (and should not)
reflect maximum performance/workload! - Latency did not degrade gracefully in proportion
to workload - At high loads, PRESS oscillates between full
service, 100 load shedding
23Further Work
- Combine test results predicted component
failure rates to get long-term performability
estimates (are these useful?) - Further study will benefit from more
sophisticated client workload simulators - Services that generate dynamic content should
lead to more interesting data (ex RUBiS)
24Latency as a Performability Metric Experimental
Results
- Pete Broadwell
- pbwell_at_cs.berkeley.edu
25Example long-term model
Discrete-time Markov chain (DTMC) model of a
RAID-5 disk array1
pi(t) probability that system is in state i at
time t
D number of data disks
wi(t) reward (disk I/O operations/sec)
m disk repair rate
l failure rate of a single disk drive
1 Hannu H. Kari, Ph.D. Thesis, Helsinki
University of Technology, 1997