Title: The DataIndex
1The DataIndex
- A novel paradigm for data storage and retrieval
- Both a storage and an access structure
- Indexing comes for free
- Based on, and extends the notions of vertical
partitioning and transposed files - Two kinds presented here
- Basic DataIndex (BDI)
- Join DataIndex (JDI)
2Related Work
- Variant Indexes ONeil Quass, 97
- B-tree (all systems)
- RID-list for each search-key value
- Bitmapped (almost all systems)
- Bit-vectors (usually compressed) instead of
RID-lists - Projection (Sybase IQ)
- Mirror copy of column
- Bit-sliced (Sybase IQ)
- bit-level projection index
- Join Indexes Valduriez et. al., 86
- Bitmapped-Join Indexes ONeil Graefe 95
(Informix)
- Limitations
- These structures are maintained in addition to
the base table. - Query response times are unacceptable in
interactive contexts.
3The Basic DataIndex (BDI)
- Projection-like Index with matching column
removed from the table - Can have multiple columns (e.g., no TPC-D query
asks for ExtPrice or Discount alone) - For this presentation, we assume single-column
BDIs
Base Table
CustKey
Qty
Discount
ExtPrice
CK1
Q1
D1
E1
CK2
Q2
D2
E2
CK3
Q3
D3
E3
CK4
Q4
D4
E4
BDI
BDI
BDI
Discount
ExtPrice
Qty
D1
E1
Q1
D2
E2
Q2
D3
E3
Q3
D4
E4
Q4
4The Join DataIndex (JDI)
- JDI is BDI of RIDs to foreign table.
Base Fact Table
Base Dimension Table
Tax
Name
Address
Discount
ExtPrice
CustKey
CustKey
T1
N1
A1
D1
E1
CK1
CK1
T2
N2
A2
D2
E2
CK2
CK2
T3
N3
A3
D3
E3
CK3
CK3
T4
D4
E4
CK3
JDI
BDI
BDI
BDI
BDI
BDI
Tax
Name
Address
Discount
ExtPrice
CustKey
T1
N1
A1
D1
E1
CK1
T2
N2
A2
D2
E2
CK2
T3
N3
A3
D3
E3
CK3
T4
D4
E4
- Joins can be processed efficiently
5Maintaining Logical Records
- Order of records is conserved in DataIndexes
- A simple arithmetic mapping is used to associate
fields of a record - Records in each vertical partition can easily be
mapped to blocks and vice-versa - RID(Block ID, Slot Number within that Block)
- (Block ID, Slot Number) to Position
- Position to (Block ID, Slot Number)
6Query Processing with DataIndexes
- Two common classes of queries in data
warehousing - Range queries
- Star join queries
- Example range query
- SELECT CustKey FROM SALES
- WHERE Qtygt10
- Apply restrictions to form rowset(s)
- Load display BDI(s) into memory
7Star Join Queries
- A fact table is joined with a set of dimension
tables - SELECT Column-list FROM FactTable,
DimensionTables WHERE SelectionPredicates AND
JoinPredicates - JoinPredicates Fact.Attr1 Dimension.Attr2
- General Technique Used to Evaluate
- 1. Apply SelectionPredicates on individual
tables. - 2. Perform Join on restricted set of rows or
rowsets.
8Evaluating Star Joins Using DataIndexes
- Propose 2 efficient algorithms
- 1. Star Join with Large memory (SJL)
- 2. Star Join with Small memory (SJS)
- Has negligible memory requirements
- Less efficient than SJL
9The SJL Algorithm
- Input
- set of dimension tables participating in join
- set of dimension table display columns
- set of fact table display columns
- set of rowsets, one for each dimension table and
one for fact table (RF) - Steps
- Load all dimension display column BDIs into
memory - Scan RF
- For each JDI
- If bit not set in corresponding element of
dimension rowset - Read next row of RF
- Else create output
- Use JDI to access dimension display columns
- Use ordinal position to access fact table display
columns
10Example Star Schema
- Based on TPC-D(Scale Factor 1)
- 4 Dimension Tables
- PART
- SUPPLIER
- CUSTOMER
- TIME
- 1 Fact Table
- SALES
PART
CUSTOMER
PartKey 4 Name 55 Mfgr 25 Brand 10 Type
25 Size 4 Others... 41 164
CustKey 4 Name 25 Address
40 Nation 25 Region 25 Phone
15 AcctBal 8 MktSegment 10 Comment 117
269
SALES
PartKey 4 SuppKey 4 CustKey
4 Quantity 8 ExtPrice 8 Discount
8 Tax 8 RetFlag 1 Status
1 ShipDate 2 CommitDate 2 ReceiptDate
2 ShipInstruct 25 ShipMode 10 Comment
44 137
200,000
150,000
SUPPLIER
SuppKey 4 Name 25 Address 40 Nation
25 Region 25 Phone 15 AcctBal 8 Comment
101 243
TIME
TimeKey 2 Alpha 10 Year 4 Month 4 Week
4 Day 4 28
6,000,000
2,557
10,000
11SJL Algorithm Example
Sample Query SELECT Mfgr, AcctBal, Quantity,
ExtPriceFROM SALES S, PART P, SUPPLIER U WHERE
S.PartKeyP.PartKey AND U.SuppKeyP.SuppKey AND
Sizelt100 AND RetFlag1 AND NationUnited
States D PART, SUPPLIER CD Mfgr,
AcctBal CF Quantity, ExtPrice
Step 0 Perform all selections on single
tables (Sizelt100 AND RetFlag0 ) Create
corresponding Rowsets R RPART, RSUPP, RSALES
12SJL Algorithm Example (2)
Step 1 (1-2) Load Mfgr AcctBal BDIs into
Memory.
Step 2a (3-6) Scan RSALES For each
record Check PartKey JDI against RPART
Check SuppKey JDI against RSUPP
13SJL Algorithm Example (3)
Step 2b (7-8) Access in-memory BDIs for each
matching record
Step 2c (9-10) Access Fact Table BDIs from
disk for each matching record
Step 3 Output each record
Output
14About SJL
- Advantages
- Accesses each fact table block only once, one
block at a time. - Accesses each dimension table block only once.
- Accesses only relevant columns (and JDIs).
- Memory requirements dependent only on size of
displayed dimension BDIs - independent of fact table size
- Time complexity O (F)
- Disadvantage
- May still require significant amounts of memory
in some cases (extremely large dimension tables). - We thus propose SJS
15The SJS Algorithm
- Input same as SJL
- 4 Phases
- 1. RF restriction Restricts RF to rows
appearing in join result. - Scan RF
- For each JDI
- If bit not set in corresponding element of
dimension rowset - Clear bit in RF
- 2. JDI restriction Restricts JDIs to rows
appearing in join result. - Scan RF
- For each JDI
- If bit set in corresponding RF row
- Write JDI element to restricted JDI (JDIR ) on
disk
16The SJS Algorithm (2)
- 3. Output BDI Creation Creates output BDI for
dimension display columns. - For each dimension display column BDI
- Load a portion of BDI into memory (as much as can
fit) - Scan JDIR
- Write matching entries to output BDI (in JDIR
order) - Repeat until entire BDI processed
- 4. Final Output Merge Merges dimension and fact
table display columns. - Scan RF
- Use ordinal position to access dimension display
columns from output BDI - Use ordinal position to access fact table display
columns
17About SJS
- Used when dimensional BDIs do not fit in memory
- JDI scanned multiple times, but (large) BDI
scanned only once. - Time complexity O(D F) (D size of BDI)
- Smaller than O(F2) that can occur with hashing.
- Most often affects only one or a few columns.
18Comparative Analysis
- Analysis of star-join query cost for
bitmapped-join index (BJI) and DataIndex (SJL
SJS) approaches - Comparison of star-join performance for
- best case performance of BJI
- worst case performance of SJL SJS
- Metric number of disk accesses
- Query
SELECT U.Name, S.ExtPrice FROM SALES S, TIME T,
CUSTOMER C, SUPPLIER U WHERE T.Year BETWEEN 1996
AND 1998 AND U.NationUnited States AND
C.NationUnited States AND S.ShipDateT.TimeKey
AND S.CustKeyC.CustKey AND S.SuppKeyU.SuppKey
19Selected Baseline Parameter Settings
- Selectivity on fact table is 1
- Selectivity on each dimension table is 5
- Number of distinct search key values in a range
selection is 2 - Compression level is 20
- Size of warehouse varies from 86 MB to 860 GB
- Size of
- Data Block 8,192 bytes
- RID 6 bytes
- Pointer to data block 4 bytes
20Baseline Performance
Query Evaluation Cost, N
Scale Factor
21Memory Requirements for SJL BJI
Memory Requirements (MB)
Scale Factor
22DataIndex Implementation
- We have implemented the DataIndex strategy
- Written in C
- Platforms supported Solaris, Linux, HP-UX, DEC,
NT - Performance evaluation on NT platform
- Comparison with Oracle, Red Brick, and DB2 in
terms of query processing, storage, and loading
costs - Minimal indexing scheme used for commercial
systems - Platform Windows NT, 300 MHz Pentium, 64 MB RAM
- Much larger tests run on various platforms
23Schema Used in Analysis
Table of Records PURCHASE 5M 14M
22M CUSTOMER 10K 20K 40K PRODUCT 100K 200K 400K T
IME 2.5K 5K 10K
24Query Processing Tests
Characteristics
Query
Find products having high sales volumes.
2-way join, 2 restrictions
Find elderly customers who purchased large
quantities of a given range of products and the
month of purchase.
4-way join, 2 restrictions
Find elderly customers who purchased large
quantities of a given range of products. List the
total quantity purchased by customer, product,
and month.
4-way join, 3 restrictions, aggregation with 3
GROUP BY columns
25Query Performance 2-Way Join
Response Time (seconds)
Raw Data Size (GB)
26Query Performance 4-Way Join, Aggregation
Response Time (seconds)
Raw Data Size (GB)
27Storage Requirements
Indexed Data Size (GB)
Raw Data Size (GB)
28Loading Times
Load Time (seconds)
Raw Data Size (GB)
29Other Advantages of DataIndexes
- Compression
- Small range of values yields high compressibility
- Algorithms exist for scanning compressed data
- Bulk Update (Warehouse Loads)
- No need to update indexes
- Buffer Utilization
- Columns that are accessed frequently may be
pinned in memory