Initial Availability Benchmarking of a Database System - PowerPoint PPT Presentation

About This Presentation

Title:

Initial Availability Benchmarking of a Database System

Description:

time-averaged, binary view of system state (up/down) This metric is inflexible ... system A is down for 2 seconds per minute. system B is down for 1 day per month ... – PowerPoint PPT presentation

Number of Views:107

Avg rating:3.0/5.0

Slides: 30

Provided by: dbCsBe

Learn more at: https://dsf.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Initial Availability Benchmarking of a Database System

1
Initial Availability Benchmarking of a Database
System

Aaron Brown
abrown_at_cs.berkeley.edu
DBLunch Seminar, 1/23/01

2
Motivation

Availability is a key metric for modern apps.
e-commerce, enterprise apps, online services,
ISPs
Database availability is particularly important
databases hold the critical hard state for most
enterprise and e-business applications
the most important system component to keep
available
we trust databases to be highly dependable.
Should we?
how do DBMSs react to hardware faults/failures?
what is the user-visible impact of such failures?

3
Overview of approach

Use availability benchmarking to evaluate
database dependability
an empirical technique based on simulated faults
study 3-tier OLTP workload
back-end commercial database
middleware transaction monitor business logic
front-end web-based form interface
focus on storage system faults/failures
measure availability in terms of performance
also possible to look at consistency of data

4
Outline

Availability benchmarking methodology
Adapting methodology for OLTP databases
Case study of Microsoft SQL Server 2000
Discussion and future directions

5
Availability benchmarking

A general methodology for defining and measuring
availability
focused toward research, not marketing
empirically demonstrated with software RAID
systems Usenix00
3 components
1) metrics
2) benchmarking techniques
3) representation of results

6
Part 1 Availability metrics

Traditionally, percentage of time system is up
time-averaged, binary view of system state
(up/down)
This metric is inflexible
doesnt capture degraded states
a non-binary spectrum between up and down
time-averaging discards important temporal
behavior
compare 2 systems with 96.7 traditional
availability
system A is down for 2 seconds per minute
system B is down for 1 day per month

Our solution measure variation in system quality
of service metrics over time
performance, fault-tolerance, completeness,
accuracy

7
Part 2 Measurement techniques

Goal quantify variation in QoS metrics as system
availability is compromised
Leverage existing performance benchmarks
to measure trace quality of service metrics
to generate fair workloads
Use fault injection to compromise system
hardware and software faults
maintenance events (repairs, SW/HW upgrades)
Examine single-fault and multi-fault workloads
the availability analogues of performance micro-
and macro-benchmarks

8
Part 3 Representing results

Results are most accessible graphically
plot change in QoS metrics over time
compare to normal behavior
99 confidence intervals calculated from no-fault
runs

Graphs can be distilled into numbers

9
Outline

Availability benchmarking methodology
Adapting methodology for OLTP databases
metrics
workload and fault injection
Case study of Microsoft SQL Server 2000
Discussion and future directions

10
Availability metrics for databases

Possible OLTP quality of service metrics
transaction throughput
transaction response time
better of transactions longer than a fixed
cutoff
rate of transactions aborted due to errors
consistency of database
fraction of database content available
Our experiments focused on throughput
rates of normal and failed transactions

11
Workload fault injection

Performance workload
easy TPC-C
Fault workload disk subsystem
realistic fault set based on Tertiary Disk study
correctable uncorrectable media errors,
hardware errors, power failures, disk
hangs/timeouts
both transient and sticky faults
injected via an emulated SCSI disk (0.5ms
overhead)
faults injected in one of two partitions
database data partition
databases write-ahead log partition

12
Outline

Availability benchmarking methodology
Adapting methodology for OLTP databases
Case study of Microsoft SQL Server 2000
Discussion and future directions

13
Experimental setup

Database
Microsoft SQL Server 2000, default configuration
Middleware/front-end software
Microsoft COM transaction monitor/coordinator
IIS 5.0 web server with Microsofts tpcc.dll HTML
terminal interface and business logic
Microsoft BenchCraft remote terminal emulator
TPC-C-like OLTP order-entry workload
10 warehouses, 100 active users, 860 MB database
Measured metrics
throughput of correct NewOrder transactions/min
rate of aborted NewOrder transactions (txn/min)

14
Experimental setup (2)
DB Server
Front End
Adaptec3940
100mbEthernet
MS BenchCraft RTEIIS MS tpcc.dllMS COM
IBM18 GB10k RPM
SQL Server 2000
Intel P-III/450256 MB DRAMWindows 2000 AS
AMD K6-2/333128 MB DRAMWindows 2000 AS
DB data/log disks

Database installed in one of two configurations
data on emulated disk, log on real (IBM) disk
data on real (IBM) disk, log on emulated disk

15
Results

All results are from single-fault
micro-benchmarks
14 different fault types
injected once for each of data and log partitions
4 categories of behavior detected
1) normal
2) transient glitch
3) degraded
4) failed

16
Type 1 normal behavior

System tolerates fault
Demonstrated for all sector-level faults except
sticky uncorrectable read, data partition
sticky uncorrectable write, log partition

17
Type 2 transient glitch

One transaction is affected, aborts with error
Subsequent transactions using same data would
fail
Demonstrated for one fault only
sticky uncorrectable read, data partition

18
Type 3 degraded behavior

DBMS survives error after running log recovery
Middleware partially fails, results in degraded
perf.
Demonstrated for one fault only
sticky uncorrectable write, log partition

19
Type 4 failure

Example behaviors (10 distinct variants observed)

Disk hang during write to data disk
Simulated log disk power failure

DBMS hangs or aborts all transactions
Middleware behaves erratically, sometimes
crashing
Demonstrated for all fatal disk-level faults
SCSI hangs, disk power failures

20
Results summary

DBMS was robust to a wide range of faults
tolerated all transient and recoverable errors
tolerated some unrecoverable faults
transparently (e.g., uncorrectable data writes)
or by reflecting fault back via transaction abort
these were not tolerated by the SW RAID systems
Overall, DBMS is significantly more robust to
disk faults than software RAID on same OS!

21
Outline

Availability benchmarking methodology
Adapting methodology for OLTP databases
Case study of Microsoft SQL Server 2000
Discussion and future directions

22
Results discussion

DBMSs extra robustness comes from
redundant data representation in form of log
transactions
standard mechanism for reporting errors (txn
abort)
encapsulate meaningful unit of work, providing
consistent rollback upon failure
But, middleware was not robust, compromising
overall system availability
crashed or behaved erratically when DBMS
recovered or returned errors
user cannot distinguish DBMS and middleware
failure
system is only as robust as its weakest
component!

compare RAID blocks dont let you do this
23
Discussion of methodology

General availability benchmarking methodology
does work on more than just RAID systems
Issues in adapting the methodology
defining appropriate metrics
measuring non-performance availability metrics
understanding layered (multi-tier) systems with
only end-to-end instrumentation

24
Discussion of methodology
DO NOT PROJECT THIS SLIDE!

General availability benchmarking methodology
does work on more than just RAID systems
Issues in adapting the methodology
defining appropriate metrics
metrics to capture database ACID properties
adapting binary metrics such as data
consistency
measuring non-performance availability metrics
existing benchmarks (like TPC-C) may not do this
understanding layered (multi-tier) systems with
only end-to-end instrumentation
teasing apart availability impact of different
layers

25
Future directions

Direct extensions of this work
expand metrics, including tests of ACID
properties
consider other fault injection points besides
disks
investigate clustered database designs
study issues in benchmarking layered systems

26
Future directions (2)

Availability/maintainability extensions to TPC
proposed by James Hamilton at ISTORE retreat
an optional maintainability test after regular
run
sponsor supplies N best administrators
TPC benchmark run repeated with realistic fault
injection and a set of maintenance tasks to
perform
measure availability, performance, admin. time, .
. .
requires
characterization of typical failure modes, admin.
tasks
scalable, easy-to-deploy fault-injection harness
This work is a (small) step toward that goal
and hints at poor state-of-the-art in TPC-C
benchmark middleware fault handling

27
Thanks!

Microsoft SQL Server group
for generously providing access to SQL Server
2000 and the Microsoft TPC-C Benchmark Kit
James Hamilton
Jamie Redding and Charles Levine

28
Backup slides
29
Example results failing data disk

Transient, correctable read fault
(system tolerates fault)

Sticky, uncorrectable read fault (transaction is
aborted with error)
Disk hang between SCSI commands (DBMS hangs,
middleware returns errors)
Disk hang during a data write (DBMS hangs,
middleware crashes)
30
Example results failing log disk
Transient, correctable write fault (system
tolerates fault)