TPCHDS - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

TPCHDS

Description:

TPC-meeting Dana Point, 2005. 2. TPC-DS. Brief overview of the highlights of ... TPC-meeting Dana Point, 2005. 8. Data Set. Size is defined in scale-factors ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 26
Provided by: meikel
Category:
Tags: tpchds | dana

less

Transcript and Presenter's Notes

Title: TPCHDS


1
TPC-H/DS
  • TPC-DS Subcommittee Reports
  • Meikel Poess

2
TPC-DS
  • Brief overview of the highlights of TPC-DS
  • Schema
  • Dataset
  • Queries
  • Data Maintenance
  • Execution
  • Positioning of TPC-DS to TPC-H

3
(No Transcript)
4
Requirements
  • Skewed Data
  • Many queries in SQL standard
  • Allowance for auxiliary data structures
  • Server centric ETL process
  • Robust benchmark

5
Business Model
  • Retailer with
  • Store,Catalog and Web sales
  • Inventory
  • Detailed Customer Information
  • Customer Demographics
  • Household Demographics
  • Address etc.
  • Detailed Item Information

6
Schema TPC-DS Fact Tables
Store Returns
Store Sales
  • 3 sales channels Catalog - Web - Store
  • 7 fact tables
  • 2 fact tables for each sales channel
  • 24 tables total
  • Basic (TPC-H like) auxiliary data structure are
    allowed on all tables
  • Aggregation and fancy auxiliary data structures
    are only allowed on Catalog Sales and Catalog
    Returns

7
Schema Store Channel w/ Dimensions
store_sales
8
Data Set
  • Size is defined in scale-factors
  • Scale factor indicates raw data size in GB
  • 1, 100, 300, 1000, 3000, 10000, 30000, 100000
  • Row counts for tables scale realistically
  • Fact tables grow linearly
  • Dimension tables grow sub-linearly

9
Data Set
  • Some data has real world content
  • Names, address, etc.
  • Data is skewed
  • E.g. more sales in December than in June, or,
    more green items than red
  • ? imposes challenges on parameter substitution in
    queries

10
Table Sizes at 100GB
11
SS_SOLD_DATE Distribution
14 of all sales happen between January and July
58 of all sales happen in November and December
Group 3
28 of all sales happen between August and
October
Group 2
Group 1
12
I_COLOR
8 of all colors are in Group 1
24 of all colors are in group 2
68 of all colors are in Group 3
Group 3
Group 2
Group 1
13
Data Maintenance Model
  • (Extraction) - Transformation - Load

Normalized Data
Star Schema
TransformationSlowly Changing Dimensions
flatfile 1
L
DW
T
flatfile 2
...
RDBMS
flatfile n
14
Query Model
  • Query Language SQL99 OLAP extensions
  • variants are allowed, but need to be approved
  • Four types of queries
  • reporting
  • ad hoc
  • Iterative (drill down, up etc.)
  • data mining queries (large amount of output)
  • 99 different queries per user
  • Large number of queries are challenging for
    running a benchmark because
  • System is harder to optimize
  • It reduces over-optimization (auxiliary data
    structures)
  • Tests more features of todays RDBMS
  • ? Extends Benchmark life time

15
Benchmark Execution
Setup and initialization of the disk
subsystem Creation of Test Data
(dbgen) Initialization of DBMS
Bulk load of test data Index creation Constraint
validation Statistic gathering
  • 5 Phases (timed and untimed)
  • Sequence is

DBMS, Disk Setup flat-file creation
TPC-DS Creation
Query Run 1
Data Maintenance
Query Run 2
untimed
Load Test
Query/Refresh Test
16
Queries Per Hour
  • Queries per Hour
  • Where
  • TQueryRun1 total elapsed time to complete query
    run 1.
  • TQueryRun2 total elapsed time to complete query
    run 2.
  • TDataMaintenance total elapsed time to complete
    the data maintenance run.
  • TLoad total elapsed time to complete the
    database load test.
  • S Number of streams
  • SF Scale Factor

198
17
Main Metrics
  • Price Performance
  • Where
  • P price of the Priced Configuration as defined
    in the specification
  • Qph-DS_at_SF reported performance metric as
    reported

18
Main Metrics
  • System Availability Date

19
Secondary Metrics
  • Load time _at_ SF, as defined in Clause
  • Refresh time _at_ SF, as defined in Clause
  • Query Elapsed time _at_ SF, as defined in Clause

20
Schema
21
Data Set
SF300004.5 Billion customers 6 Billion parts
SF3000080 Million customers 217000 parts
22
Queries
Achieved by limiting database technology
Reporting, Ad-Hoc, iterative and data mining
23
Data Maintenance
24
Execution
25
Miscellaneous
Write a Comment
User Comments (0)
About PowerShow.com