Title: TPCHDS
1TPC-H/DS
- TPC-DS Subcommittee Reports
- Meikel Poess
2TPC-DS
- Brief overview of the highlights of TPC-DS
- Schema
- Dataset
- Queries
- Data Maintenance
- Execution
- Positioning of TPC-DS to TPC-H
3(No Transcript)
4Requirements
- Skewed Data
- Many queries in SQL standard
- Allowance for auxiliary data structures
- Server centric ETL process
- Robust benchmark
5Business Model
- Retailer with
- Store,Catalog and Web sales
- Inventory
- Detailed Customer Information
- Customer Demographics
- Household Demographics
- Address etc.
- Detailed Item Information
6Schema TPC-DS Fact Tables
Store Returns
Store Sales
- 3 sales channels Catalog - Web - Store
- 7 fact tables
- 2 fact tables for each sales channel
- 24 tables total
- Basic (TPC-H like) auxiliary data structure are
allowed on all tables - Aggregation and fancy auxiliary data structures
are only allowed on Catalog Sales and Catalog
Returns
7Schema Store Channel w/ Dimensions
store_sales
8Data Set
- Size is defined in scale-factors
- Scale factor indicates raw data size in GB
- 1, 100, 300, 1000, 3000, 10000, 30000, 100000
- Row counts for tables scale realistically
- Fact tables grow linearly
- Dimension tables grow sub-linearly
9Data Set
- Some data has real world content
- Names, address, etc.
- Data is skewed
- E.g. more sales in December than in June, or,
more green items than red - ? imposes challenges on parameter substitution in
queries
10Table Sizes at 100GB
11SS_SOLD_DATE Distribution
14 of all sales happen between January and July
58 of all sales happen in November and December
Group 3
28 of all sales happen between August and
October
Group 2
Group 1
12I_COLOR
8 of all colors are in Group 1
24 of all colors are in group 2
68 of all colors are in Group 3
Group 3
Group 2
Group 1
13Data Maintenance Model
- (Extraction) - Transformation - Load
Normalized Data
Star Schema
TransformationSlowly Changing Dimensions
flatfile 1
L
DW
T
flatfile 2
...
RDBMS
flatfile n
14Query Model
- Query Language SQL99 OLAP extensions
- variants are allowed, but need to be approved
- Four types of queries
- reporting
- ad hoc
- Iterative (drill down, up etc.)
- data mining queries (large amount of output)
- 99 different queries per user
- Large number of queries are challenging for
running a benchmark because - System is harder to optimize
- It reduces over-optimization (auxiliary data
structures) - Tests more features of todays RDBMS
- ? Extends Benchmark life time
15Benchmark Execution
Setup and initialization of the disk
subsystem Creation of Test Data
(dbgen) Initialization of DBMS
Bulk load of test data Index creation Constraint
validation Statistic gathering
- 5 Phases (timed and untimed)
- Sequence is
DBMS, Disk Setup flat-file creation
TPC-DS Creation
Query Run 1
Data Maintenance
Query Run 2
untimed
Load Test
Query/Refresh Test
16Queries Per Hour
- Queries per Hour
- Where
- TQueryRun1 total elapsed time to complete query
run 1. - TQueryRun2 total elapsed time to complete query
run 2. - TDataMaintenance total elapsed time to complete
the data maintenance run. - TLoad total elapsed time to complete the
database load test. - S Number of streams
- SF Scale Factor
198
17Main Metrics
- Price Performance
- Where
- P price of the Priced Configuration as defined
in the specification - Qph-DS_at_SF reported performance metric as
reported
18Main Metrics
19Secondary Metrics
- Load time _at_ SF, as defined in Clause
- Refresh time _at_ SF, as defined in Clause
- Query Elapsed time _at_ SF, as defined in Clause
20Schema
21Data Set
SF300004.5 Billion customers 6 Billion parts
SF3000080 Million customers 217000 parts
22Queries
Achieved by limiting database technology
Reporting, Ad-Hoc, iterative and data mining
23Data Maintenance
24Execution
25Miscellaneous