Initial Availability Benchmarking of a Database System - PowerPoint PPT Presentation

About This Presentation
Title:

Initial Availability Benchmarking of a Database System

Description:

time-averaged, binary view of system state (up/down) This metric is inflexible ... system A is down for 2 seconds per minute. system B is down for 1 day per month ... – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 30
Provided by: dbCsBe
Learn more at: https://dsf.berkeley.edu
Category:

less

Transcript and Presenter's Notes

Title: Initial Availability Benchmarking of a Database System


1
Initial Availability Benchmarking of a Database
System
  • Aaron Brown
  • abrown_at_cs.berkeley.edu
  • DBLunch Seminar, 1/23/01

2
Motivation
  • Availability is a key metric for modern apps.
  • e-commerce, enterprise apps, online services,
    ISPs
  • Database availability is particularly important
  • databases hold the critical hard state for most
    enterprise and e-business applications
  • the most important system component to keep
    available
  • we trust databases to be highly dependable.
    Should we?
  • how do DBMSs react to hardware faults/failures?
  • what is the user-visible impact of such failures?

3
Overview of approach
  • Use availability benchmarking to evaluate
    database dependability
  • an empirical technique based on simulated faults
  • study 3-tier OLTP workload
  • back-end commercial database
  • middleware transaction monitor business logic
  • front-end web-based form interface
  • focus on storage system faults/failures
  • measure availability in terms of performance
  • also possible to look at consistency of data

4
Outline
  • Availability benchmarking methodology
  • Adapting methodology for OLTP databases
  • Case study of Microsoft SQL Server 2000
  • Discussion and future directions

5
Availability benchmarking
  • A general methodology for defining and measuring
    availability
  • focused toward research, not marketing
  • empirically demonstrated with software RAID
    systems Usenix00
  • 3 components
  • 1) metrics
  • 2) benchmarking techniques
  • 3) representation of results

6
Part 1 Availability metrics
  • Traditionally, percentage of time system is up
  • time-averaged, binary view of system state
    (up/down)
  • This metric is inflexible
  • doesnt capture degraded states
  • a non-binary spectrum between up and down
  • time-averaging discards important temporal
    behavior
  • compare 2 systems with 96.7 traditional
    availability
  • system A is down for 2 seconds per minute
  • system B is down for 1 day per month
  • Our solution measure variation in system quality
    of service metrics over time
  • performance, fault-tolerance, completeness,
    accuracy

7
Part 2 Measurement techniques
  • Goal quantify variation in QoS metrics as system
    availability is compromised
  • Leverage existing performance benchmarks
  • to measure trace quality of service metrics
  • to generate fair workloads
  • Use fault injection to compromise system
  • hardware and software faults
  • maintenance events (repairs, SW/HW upgrades)
  • Examine single-fault and multi-fault workloads
  • the availability analogues of performance micro-
    and macro-benchmarks

8
Part 3 Representing results
  • Results are most accessible graphically
  • plot change in QoS metrics over time
  • compare to normal behavior
  • 99 confidence intervals calculated from no-fault
    runs
  • Graphs can be distilled into numbers

9
Outline
  • Availability benchmarking methodology
  • Adapting methodology for OLTP databases
  • metrics
  • workload and fault injection
  • Case study of Microsoft SQL Server 2000
  • Discussion and future directions

10
Availability metrics for databases
  • Possible OLTP quality of service metrics
  • transaction throughput
  • transaction response time
  • better of transactions longer than a fixed
    cutoff
  • rate of transactions aborted due to errors
  • consistency of database
  • fraction of database content available
  • Our experiments focused on throughput
  • rates of normal and failed transactions

11
Workload fault injection
  • Performance workload
  • easy TPC-C
  • Fault workload disk subsystem
  • realistic fault set based on Tertiary Disk study
  • correctable uncorrectable media errors,
    hardware errors, power failures, disk
    hangs/timeouts
  • both transient and sticky faults
  • injected via an emulated SCSI disk (0.5ms
    overhead)
  • faults injected in one of two partitions
  • database data partition
  • databases write-ahead log partition

12
Outline
  • Availability benchmarking methodology
  • Adapting methodology for OLTP databases
  • Case study of Microsoft SQL Server 2000
  • Discussion and future directions

13
Experimental setup
  • Database
  • Microsoft SQL Server 2000, default configuration
  • Middleware/front-end software
  • Microsoft COM transaction monitor/coordinator
  • IIS 5.0 web server with Microsofts tpcc.dll HTML
    terminal interface and business logic
  • Microsoft BenchCraft remote terminal emulator
  • TPC-C-like OLTP order-entry workload
  • 10 warehouses, 100 active users, 860 MB database
  • Measured metrics
  • throughput of correct NewOrder transactions/min
  • rate of aborted NewOrder transactions (txn/min)

14
Experimental setup (2)
DB Server
Front End
Adaptec3940
100mbEthernet
MS BenchCraft RTEIIS MS tpcc.dllMS COM
IBM18 GB10k RPM
SQL Server 2000
Intel P-III/450256 MB DRAMWindows 2000 AS
AMD K6-2/333128 MB DRAMWindows 2000 AS
DB data/log disks
  • Database installed in one of two configurations
  • data on emulated disk, log on real (IBM) disk
  • data on real (IBM) disk, log on emulated disk

15
Results
  • All results are from single-fault
    micro-benchmarks
  • 14 different fault types
  • injected once for each of data and log partitions
  • 4 categories of behavior detected
  • 1) normal
  • 2) transient glitch
  • 3) degraded
  • 4) failed

16
Type 1 normal behavior
  • System tolerates fault
  • Demonstrated for all sector-level faults except
  • sticky uncorrectable read, data partition
  • sticky uncorrectable write, log partition

17
Type 2 transient glitch
  • One transaction is affected, aborts with error
  • Subsequent transactions using same data would
    fail
  • Demonstrated for one fault only
  • sticky uncorrectable read, data partition

18
Type 3 degraded behavior
  • DBMS survives error after running log recovery
  • Middleware partially fails, results in degraded
    perf.
  • Demonstrated for one fault only
  • sticky uncorrectable write, log partition

19
Type 4 failure
  • Example behaviors (10 distinct variants observed)

Disk hang during write to data disk
Simulated log disk power failure
  • DBMS hangs or aborts all transactions
  • Middleware behaves erratically, sometimes
    crashing
  • Demonstrated for all fatal disk-level faults
  • SCSI hangs, disk power failures

20
Results summary
  • DBMS was robust to a wide range of faults
  • tolerated all transient and recoverable errors
  • tolerated some unrecoverable faults
  • transparently (e.g., uncorrectable data writes)
  • or by reflecting fault back via transaction abort
  • these were not tolerated by the SW RAID systems
  • Overall, DBMS is significantly more robust to
    disk faults than software RAID on same OS!

21
Outline
  • Availability benchmarking methodology
  • Adapting methodology for OLTP databases
  • Case study of Microsoft SQL Server 2000
  • Discussion and future directions

22
Results discussion
  • DBMSs extra robustness comes from
  • redundant data representation in form of log
  • transactions
  • standard mechanism for reporting errors (txn
    abort)
  • encapsulate meaningful unit of work, providing
    consistent rollback upon failure
  • But, middleware was not robust, compromising
    overall system availability
  • crashed or behaved erratically when DBMS
    recovered or returned errors
  • user cannot distinguish DBMS and middleware
    failure
  • system is only as robust as its weakest
    component!

compare RAID blocks dont let you do this
23
Discussion of methodology
  • General availability benchmarking methodology
    does work on more than just RAID systems
  • Issues in adapting the methodology
  • defining appropriate metrics
  • measuring non-performance availability metrics
  • understanding layered (multi-tier) systems with
    only end-to-end instrumentation

24
Discussion of methodology
DO NOT PROJECT THIS SLIDE!
  • General availability benchmarking methodology
    does work on more than just RAID systems
  • Issues in adapting the methodology
  • defining appropriate metrics
  • metrics to capture database ACID properties
  • adapting binary metrics such as data
    consistency
  • measuring non-performance availability metrics
  • existing benchmarks (like TPC-C) may not do this
  • understanding layered (multi-tier) systems with
    only end-to-end instrumentation
  • teasing apart availability impact of different
    layers

25
Future directions
  • Direct extensions of this work
  • expand metrics, including tests of ACID
    properties
  • consider other fault injection points besides
    disks
  • investigate clustered database designs
  • study issues in benchmarking layered systems

26
Future directions (2)
  • Availability/maintainability extensions to TPC
  • proposed by James Hamilton at ISTORE retreat
  • an optional maintainability test after regular
    run
  • sponsor supplies N best administrators
  • TPC benchmark run repeated with realistic fault
    injection and a set of maintenance tasks to
    perform
  • measure availability, performance, admin. time, .
    . .
  • requires
  • characterization of typical failure modes, admin.
    tasks
  • scalable, easy-to-deploy fault-injection harness
  • This work is a (small) step toward that goal
  • and hints at poor state-of-the-art in TPC-C
    benchmark middleware fault handling

27
Thanks!
  • Microsoft SQL Server group
  • for generously providing access to SQL Server
    2000 and the Microsoft TPC-C Benchmark Kit
  • James Hamilton
  • Jamie Redding and Charles Levine

28
Backup slides
29
Example results failing data disk
  • Transient, correctable read fault
  • (system tolerates fault)

Sticky, uncorrectable read fault (transaction is
aborted with error)
Disk hang between SCSI commands (DBMS hangs,
middleware returns errors)
Disk hang during a data write (DBMS hangs,
middleware crashes)
30
Example results failing log disk
Transient, correctable write fault (system
tolerates fault)
  • Sticky, uncorrectable write fault
  • (DBMS recovers, middleware degrades)

Simulated disk power failure (DBMS aborts all
txns with errors)
Disk hang between SCSI commands (DBMS hangs,
middleware hangs)
Write a Comment
User Comments (0)
About PowerShow.com