Synthesizing Representative IO Workloads for TPCH

About This Presentation

Title:

Synthesizing Representative IO Workloads for TPCH

Description:

Partitioning tables across the disks. 30 GB dataset. Validation. Identify characteristics ... RMS: root-mean-square error of differences between two CDF curves ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 33

Provided by: anandsivas

Category:

more less

Transcript and Presenter's Notes

Title: Synthesizing Representative IO Workloads for TPCH

1
Synthesizing Representative I/O Workloads for
TPC-H

J. Zhang, A. Sivasubramaniam,
H. Franke, N. Gautam, Y. Zhang, S. Nagar
Pennsylvania State University
IBM T.J. Watson
Rutgers University

2
Outline

Motivation
Related Work
Methodology
Arrival Time
Access Pattern
Request Sizes
Accuracy of synthetic traces
Concluding Remarks

3
Motivation

I/O subsystems are critical for commercial
services and in production environments.
Real applications are essential for system design
and evaluation.
TPC-H is a decision-support workload for business
enterprises.

4
Disadvantages of Traces

Not easily obtainable
Can be very large
Difficult to get statistical confidence
Very difficult to change workload behavior
Does not isolate the influence of one parameter
On the other hand, a deeper understanding of the
workload can
Help generate a synthetic workload
Help in system design itself.

5
What do we need to synthesize?

Inter-arrival times (temporal behavior) of disk
block requests.
Access pattern (spatial behavior) of blocks being
referenced
Size (volume) of each I/O request.

6
Related work

Scientific Application I/O behavior
Time-series models for arrivals
Sequentiality/Markov models for access pattern
Commercial/production workloads
Self-similar arrival patterns
Sequentiality in TPC-H/TPC-D
No prior complete synthesis of all three
attributes for TPC-H

7
Our TPC-H Workload

Trace Collection Platform
IBM Netfinity 8-way SMP with 2.5GB memory and 15
disks
Linux 2.4.17
DB2 UDB EE V7.2
TPC-H Configuration
Power Run of 22 queries
Partitioning tables across the disks
30 GB dataset

8
Validation
Original I/O traces
Identify characteristics
Generate synthetic traces
Disksim 2.0
Metrics

RMS root-mean-square error of differences
between two CDF curves
nRMS RMS/m, m is average response time for the
original trace

9
Overall Methodology

Arrival pattern characteristics
Investigate correlations
Time series
Self-similar
iid distributions
Access pattern characteristics
Sequentiality/pseudo sequentiality/randomness
Size characteristics
Investigating correlations between time, space
and volume to get final synthesis

10
Arrival pattern

Statistical analysis
Auto-correlation function (ACF) plots
Shows the correlation between current
inter-arrival time and one that is x-steps away

Correlations seem very weak (lt0.15 for 12
queries, and lt0.30 for the rest)
Errors with Time series models (AR/MA/ARIMA/ARFIMA
) are high
No suggestions for self-similar either
Perhaps iid (independent and identically
distributed) is not a bad assumption.

Fitting distributions
Tried hyper-exponential/normal/pareto
Used Maximum Likelihood Estimator (normal/pareto)
and Expectation Maximization (hyper-exponential)
to estimate distribution parameters
Use K-S test to measure goodness-of-fit
Maximum distance between fitted distribution and
original CDF was ensured to be less than 0.1

13
Comparing CDF of fitted distribution and data
14
Access Pattern (Location Size)

Most studies use sequentiality to describe TPC-H
However, this is not always the case.

Location
Location
Location
Arrival Time
Arrival Time
Arrival Time
Cat1 Q10 Q4, Q14
Cat2 Q12, Q1,Q3,Q5,Q7, Q8,Q15,Q18, Q19,Q21
Cat3 Q20 Q9, Q17
15
Category 1 Intermingling sequential streams

Consider the following
Run A strictly sequential set of I/O requests
Stream A pseudo-sequential set of I/O requests
that could be interrupted by another stream.
i.e. a stream could have several runs that are
interrupted by runs of other streams.

16
Run and Stream
An example run of 5 requests
A stream (pseudo-sequential) of 4 requests
An example trace
17
Secondary Attributes

Run Length of requests in a run
Run Start location start sector of run
Stream Length of requests in a stream
Inter-stream Jump Distance spatial separation
between start of run and previous request
Intra-stream Jump Distance spatial separation
between successive requests within a stream
Number of active streams (at any instant)
Interference Distance number of requests between
2 successive requests in a stream
Derive empirical distributions for these from the
trace

18
Location Synthesis - Q10(Time and size from
trace)

LocIID locations are i.i.d.
LocRUN incorporate run length distribution and
run start location distribution.
LocSTREAM combine all stream and run statistics.

19
Request Size

Requests are one of
64, 128, 192, 256, 320, 384, 448, 512 blocks
But attributes (location, size, time) are not
independent !!!

20
Correlations between size and location
Fraction of requests
21
Correlations between size and time
22
Correlations between location and time
23
Final Synthesis Methodology (Category 1)

Location use LocSTREAM to generate start
locations. Two kinds of requests a run start
request or a request within a run
Time use Pr(inter-arrival time run start
requests) and Pr(inter-arrival time within a
run requests) to generate times.
Size
For run start request, use Pr(size
inter-arrival times of run start requests) to
generate sizes.
For within a run requests, use Pr(size within a
run requests) to generate sizes.

Can be easily adapted for Category 2 (strictly
sequential) and Category 3 (random) queries.
Validation Compare the response time
characteristics of synthesized and real trace.

25
Validation of CDF of response times(Category 1)
26
Validation of CDF of response times(Category 2)
27
Validation of CDF of response times(Category 3)
28
Storage Requirements
Storage Fraction(x0.001)
nRMS
Storage Fraction(x0.001)
nRMS
29
Contributions

A synthesis methodology to capture
Inter-mingling streams of requests
Exploiting correlations between request
attributes
An application of this methodology to TPC-H
Along the way (for TPC-H),
iid can capture arrival time characteristics
Strict sequentiality is not always the case

30
Backup slides
31
Validating arrival time synthesis
32
LocSTREAM

Use Pr(stream length) to generate stream lengths.
Use Pr(run length stream length) to generate
run lengths for each stream length.
Generate start location for each run
Use Pr(inter-stream jump dist.) to generate
the start location of the first run in the
stream.
Use Pr(intra-stream jump distance this
stream) to generate other runs start location in
this stream.
Use Pr(interference distance) to interleave all
streams.