Latency as a Performability Metric: Experimental Results

About This Presentation

Title:

Latency as a Performability Metric: Experimental Results

Description:

Performability class of metrics to describe how a ... emu. config. file. Apps. config file. Emulated LAN. Mendosus design. Experimental setup. Fault types ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 26

Provided by: petebro

Learn more at: http://roc.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Latency as a Performability Metric: Experimental Results

1
Latency as a Performability Metric Experimental
Results

Pete Broadwell
pbwell_at_cs.berkeley.edu

2
Outline

Motivation and background
Performability overview
Project summary
Test setup
PRESS web server
Mendosus fault injection system
Experimental results analysis
How to represent latency
Questions for future research

3
Performability overview

Goal of ROC project develop metrics to evaluate
new recovery techniques
Performability class of metrics to describe how
a system performs in the presence of faults
First used in fault-tolerant computing field1
Now being applied to online services

1 J. F. Meyer, Performability Evaluation Where
It Is and What Lies Ahead, 1994
4
Example microbenchmark
RAID disk failure
5
Project motivation

Rutgers study performability analysis of a web
server, using throughput
Other studies (esp. from HP Labs Storage group)
also use response time as a metric
Assertion latency and data quality are better
than throughput for describing user experience
How best to represent latency in performability
reports?

6
Project overview

Goals
Replicate PRESS/Mendosus study with response time
measurements
Discuss how to incorporate latency into
performability statistics
Contributions
Provide a latency-based analysis of a web
servers performability (currently rare)
Further the development of more comprehensive
dependability benchmarks

7
Experiment components

The Mendosus fault injection system
From Rutgers (Rich Martin)
Goal low-overhead emulation of a cluster of
workstations, injection of likely faults
The PRESS web server
Cluster-based, uses cooperative caching. Designed
by Carreira et al. (Rutgers)
Perf-PRESS basic version
HA-PRESS incorporates hearbeats, master node for
automated cluster management
Client simulators
Submit set of requests/sec, based on real traces

8
Mendosus design
Workstations (real or VMs)
Global Controller (Java)
ModifiedNICdriver
SCSImodule
procmodule
Fault config file
LAN emu config file
Apps config file
User-leveldaemon (Java)
apps
Emulated LAN
9
Experimental setup
10
Fault types
Category Fault Possible Root Cause
Node Node crash Operator error, OS bug, hardware component failure, power outage
Node Node freeze OS or kernel module bug
Application App crash Application bug or resource unavailability
Application App hang Application bug or resource contention with other processes
Network Link down or flaky Broken, damaged or misattached cable
Network Switch down or flaky Damaged or misconfigured switch, power outage
11
Test case timeline
- Warm-up time 30-60 seconds - Time to repair
up to 90 seconds
12
Simplifying assumptions

Operator repairs any non-transient failure after
90 seconds
Web page size is constant
Faults are independent
Each client request is independent of all others
(no sessions!)
Request arrival times are determined by a Poisson
process (not self-similar)
Simulated clients abandon connection attempt
after 2 secs, give up on page load after 8 secs

13
Sample result app crash
Perf-PRESS
HA-PRESS
Throughput
Latency
14
Sample result node hang
Perf-PRESS
HA-PRESS
Throughput
Latency
15
Representing latency

Total seconds of wait time
Not good for comparing cases with different
workloads
Average (mean) wait time per request
OK, but requires that expected (normal) response
time be given separately
Variance of wait time
Not very intuitive to describe. Also, read-only
workload means that all variance is toward longer
wait times anyway

16
Representing latency (2)

Consider goodput-based availability total
responses served total requests
Idea Latency-based punctuality ideal total
latency actual total latency
Like goodput, maximum value is 1
Ideal total latencyaverage latency for
non-fault cases x total requests (shouldnt be 0)

17
Representing latency (3)

Aggregate punctuality ignores brief, severe
spikes in wait time (bad for user experience)
Can capture these in a separate statistic (EX 1
of 100k responses took gt8 sec)

18
Availability and punctuality
19
Other metrics

Data quality, latency and throughput are
interrelated
Is a 5-second wait for a response worse than
waiting 1 second to get a try back later?
To combine DQ, latency and throughput, can use a
demerit system (proposed by Keynote)1
These can be very arbitrary, so its important
that the demerit formula be straightforward and
publicly available

1 Zona Research and Keynote Systems, The Need for
Speed II, 2001
20
Sample demerit system

Rules
Each aborted (2s) conn 2 demerits
Each conn error 1 demerit
Each user timeout (8s) 8 demerits
Each sec of total latency above ideal level(1
demerit/total requests) x scaling factor

21
Online service optimization
Performance metrics throughput, latency data
quality
Environment workload faults
22
Conclusions

Latency-based punctuality and throughput-based
availability give similar results for a read-only
web workload
Applied workload is very important
Reliability metrics do not (and should not)
reflect maximum performance/workload!
Latency did not degrade gracefully in proportion
to workload
At high loads, PRESS oscillates between full
service, 100 load shedding

23
Further Work

Combine test results predicted component
failure rates to get long-term performability
estimates (are these useful?)
Further study will benefit from more
sophisticated client workload simulators
Services that generate dynamic content should
lead to more interesting data (ex RUBiS)

24
Latency as a Performability Metric Experimental
Results

Pete Broadwell
pbwell_at_cs.berkeley.edu

25
Example long-term model
Discrete-time Markov chain (DTMC) model of a
RAID-5 disk array1
pi(t) probability that system is in state i at
time t
D number of data disks
wi(t) reward (disk I/O operations/sec)
m disk repair rate
l failure rate of a single disk drive
1 Hannu H. Kari, Ph.D. Thesis, Helsinki
University of Technology, 1997

Write a Comment

User Comments (0)

About PowerShow.com

Latency as a Performability Metric: Experimental Results - PowerPoint PPT Presentation

Latency as a Performability Metric: Experimental Results

Performability class of metrics to describe how a ... emu. config. file. Apps. config file. Emulated LAN. Mendosus design. Experimental setup. Fault types ... – PowerPoint PPT presentation