Title: Initial Availability Benchmarking of a Database System
1Initial Availability Benchmarking of a Database
System
- Aaron Brown
- abrown_at_cs.berkeley.edu
- DBLunch Seminar, 1/23/01
2Motivation
- Availability is a key metric for modern apps.
- e-commerce, enterprise apps, online services,
ISPs - Database availability is particularly important
- databases hold the critical hard state for most
enterprise and e-business applications - the most important system component to keep
available - we trust databases to be highly dependable.
Should we? - how do DBMSs react to hardware faults/failures?
- what is the user-visible impact of such failures?
3Overview of approach
- Use availability benchmarking to evaluate
database dependability - an empirical technique based on simulated faults
- study 3-tier OLTP workload
- back-end commercial database
- middleware transaction monitor business logic
- front-end web-based form interface
- focus on storage system faults/failures
- measure availability in terms of performance
- also possible to look at consistency of data
4Outline
- Availability benchmarking methodology
- Adapting methodology for OLTP databases
- Case study of Microsoft SQL Server 2000
- Discussion and future directions
5Availability benchmarking
- A general methodology for defining and measuring
availability - focused toward research, not marketing
- empirically demonstrated with software RAID
systems Usenix00 - 3 components
- 1) metrics
- 2) benchmarking techniques
- 3) representation of results
6Part 1 Availability metrics
- Traditionally, percentage of time system is up
- time-averaged, binary view of system state
(up/down) - This metric is inflexible
- doesnt capture degraded states
- a non-binary spectrum between up and down
- time-averaging discards important temporal
behavior - compare 2 systems with 96.7 traditional
availability - system A is down for 2 seconds per minute
- system B is down for 1 day per month
- Our solution measure variation in system quality
of service metrics over time - performance, fault-tolerance, completeness,
accuracy
7Part 2 Measurement techniques
- Goal quantify variation in QoS metrics as system
availability is compromised - Leverage existing performance benchmarks
- to measure trace quality of service metrics
- to generate fair workloads
- Use fault injection to compromise system
- hardware and software faults
- maintenance events (repairs, SW/HW upgrades)
- Examine single-fault and multi-fault workloads
- the availability analogues of performance micro-
and macro-benchmarks
8Part 3 Representing results
- Results are most accessible graphically
- plot change in QoS metrics over time
- compare to normal behavior
- 99 confidence intervals calculated from no-fault
runs
- Graphs can be distilled into numbers
9Outline
- Availability benchmarking methodology
- Adapting methodology for OLTP databases
- metrics
- workload and fault injection
- Case study of Microsoft SQL Server 2000
- Discussion and future directions
10Availability metrics for databases
- Possible OLTP quality of service metrics
- transaction throughput
- transaction response time
- better of transactions longer than a fixed
cutoff - rate of transactions aborted due to errors
- consistency of database
- fraction of database content available
- Our experiments focused on throughput
- rates of normal and failed transactions
11Workload fault injection
- Performance workload
- easy TPC-C
- Fault workload disk subsystem
- realistic fault set based on Tertiary Disk study
- correctable uncorrectable media errors,
hardware errors, power failures, disk
hangs/timeouts - both transient and sticky faults
- injected via an emulated SCSI disk (0.5ms
overhead) - faults injected in one of two partitions
- database data partition
- databases write-ahead log partition
12Outline
- Availability benchmarking methodology
- Adapting methodology for OLTP databases
- Case study of Microsoft SQL Server 2000
- Discussion and future directions
13Experimental setup
- Database
- Microsoft SQL Server 2000, default configuration
- Middleware/front-end software
- Microsoft COM transaction monitor/coordinator
- IIS 5.0 web server with Microsofts tpcc.dll HTML
terminal interface and business logic - Microsoft BenchCraft remote terminal emulator
- TPC-C-like OLTP order-entry workload
- 10 warehouses, 100 active users, 860 MB database
- Measured metrics
- throughput of correct NewOrder transactions/min
- rate of aborted NewOrder transactions (txn/min)
14Experimental setup (2)
DB Server
Front End
Adaptec3940
100mbEthernet
MS BenchCraft RTEIIS MS tpcc.dllMS COM
IBM18 GB10k RPM
SQL Server 2000
Intel P-III/450256 MB DRAMWindows 2000 AS
AMD K6-2/333128 MB DRAMWindows 2000 AS
DB data/log disks
- Database installed in one of two configurations
- data on emulated disk, log on real (IBM) disk
- data on real (IBM) disk, log on emulated disk
15Results
- All results are from single-fault
micro-benchmarks - 14 different fault types
- injected once for each of data and log partitions
- 4 categories of behavior detected
- 1) normal
- 2) transient glitch
- 3) degraded
- 4) failed
16Type 1 normal behavior
- System tolerates fault
- Demonstrated for all sector-level faults except
- sticky uncorrectable read, data partition
- sticky uncorrectable write, log partition
17Type 2 transient glitch
- One transaction is affected, aborts with error
- Subsequent transactions using same data would
fail - Demonstrated for one fault only
- sticky uncorrectable read, data partition
18Type 3 degraded behavior
- DBMS survives error after running log recovery
- Middleware partially fails, results in degraded
perf. - Demonstrated for one fault only
- sticky uncorrectable write, log partition
19Type 4 failure
- Example behaviors (10 distinct variants observed)
Disk hang during write to data disk
Simulated log disk power failure
- DBMS hangs or aborts all transactions
- Middleware behaves erratically, sometimes
crashing - Demonstrated for all fatal disk-level faults
- SCSI hangs, disk power failures
20Results summary
- DBMS was robust to a wide range of faults
- tolerated all transient and recoverable errors
- tolerated some unrecoverable faults
- transparently (e.g., uncorrectable data writes)
- or by reflecting fault back via transaction abort
- these were not tolerated by the SW RAID systems
- Overall, DBMS is significantly more robust to
disk faults than software RAID on same OS!
21Outline
- Availability benchmarking methodology
- Adapting methodology for OLTP databases
- Case study of Microsoft SQL Server 2000
- Discussion and future directions
22Results discussion
- DBMSs extra robustness comes from
- redundant data representation in form of log
- transactions
- standard mechanism for reporting errors (txn
abort) - encapsulate meaningful unit of work, providing
consistent rollback upon failure - But, middleware was not robust, compromising
overall system availability - crashed or behaved erratically when DBMS
recovered or returned errors - user cannot distinguish DBMS and middleware
failure - system is only as robust as its weakest
component!
compare RAID blocks dont let you do this
23Discussion of methodology
- General availability benchmarking methodology
does work on more than just RAID systems - Issues in adapting the methodology
- defining appropriate metrics
- measuring non-performance availability metrics
- understanding layered (multi-tier) systems with
only end-to-end instrumentation
24Discussion of methodology
DO NOT PROJECT THIS SLIDE!
- General availability benchmarking methodology
does work on more than just RAID systems - Issues in adapting the methodology
- defining appropriate metrics
- metrics to capture database ACID properties
- adapting binary metrics such as data
consistency - measuring non-performance availability metrics
- existing benchmarks (like TPC-C) may not do this
- understanding layered (multi-tier) systems with
only end-to-end instrumentation - teasing apart availability impact of different
layers
25Future directions
- Direct extensions of this work
- expand metrics, including tests of ACID
properties - consider other fault injection points besides
disks - investigate clustered database designs
- study issues in benchmarking layered systems
26Future directions (2)
- Availability/maintainability extensions to TPC
- proposed by James Hamilton at ISTORE retreat
- an optional maintainability test after regular
run - sponsor supplies N best administrators
- TPC benchmark run repeated with realistic fault
injection and a set of maintenance tasks to
perform - measure availability, performance, admin. time, .
. . - requires
- characterization of typical failure modes, admin.
tasks - scalable, easy-to-deploy fault-injection harness
- This work is a (small) step toward that goal
- and hints at poor state-of-the-art in TPC-C
benchmark middleware fault handling
27Thanks!
- Microsoft SQL Server group
- for generously providing access to SQL Server
2000 and the Microsoft TPC-C Benchmark Kit - James Hamilton
- Jamie Redding and Charles Levine
28Backup slides
29Example results failing data disk
- Transient, correctable read fault
- (system tolerates fault)
Sticky, uncorrectable read fault (transaction is
aborted with error)
Disk hang between SCSI commands (DBMS hangs,
middleware returns errors)
Disk hang during a data write (DBMS hangs,
middleware crashes)
30Example results failing log disk
Transient, correctable write fault (system
tolerates fault)
- Sticky, uncorrectable write fault
- (DBMS recovers, middleware degrades)
Simulated disk power failure (DBMS aborts all
txns with errors)
Disk hang between SCSI commands (DBMS hangs,
middleware hangs)