Inspector Joins - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Inspector Joins

Description:

Inspector Joins IC-65 Advances in Data Management Systems. Inspector Joins ... Ratio of execution time with best algo- 0.88 to 0.94 ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 38

Provided by: lsirpeo

Category:

more less

Transcript and Presenter's Notes

Title: Inspector Joins

1
Inspector Joins

By Shimin Chen, Anastassia Ailamaki, Phillip, and
Todd C. Mowry
VLDB 2005

Rammohan Narendula
2
Introduction
Query execution is I/O bound- so most of
the research concentrates on main memory Goal-
reduce no. of page faults thus reduce no. of
disk I/Os
However, hash join is a special class of
techniques where hash-join becomes CPU bound
given sufficient I/O bandwidth and
employing Advanced I/O techniques (I/O
prefetching) Goal- reduce no. of cache
misses
3
Exploiting Information about Data

Ability to improve query depends on information
quality
General stats on relations are inadequate
May lead to incorrect decisions for specific
queries
Especially true for join queries
Previous approaches exploiting dynamic
information
Collecting information from previous queries
Multi-query optimization Sellis88
Materialized views Blakeley et al. 86
Join indices Valduriez87
Dynamic re-optimization of query plans
KabraDeWitt98 Markl et al. 04
?This study exploits the inner structure of hash
joins

4
Exploiting Multi-Pass Structure of Hash Joins

Idea
Examine the actual data in I/O partitioning phase
Extract useful information to improve join phase

Inspection
5
Using Extracted Information

Enable a new join phase algorithm
Reduce the primary performance bottleneck in hash
joins
i.e. Poor CPU cache performance
Optimized for multi-processor systems
Choose the most suitable join phase algorithm for
special input cases

Join Phase
Simple Hash Join
Inspection
Cache Partitioning
I/O Partitioning
Cache Prefetching
Extracted Information
New Algorithm
6
Outline

Motivation
Previous hash join algorithms
Hash join performance on SMP systems
Inspector join
Experimental results
Conclusions

7
GRACE Hash Join

I/O Partitioning Phase
Divide input relations into partitions with a
hash function

Probe
Build
Over 70 execution time stalled on cache misses!

Join Phase (simple hash join)
Build hash table, then probe hash table

Probe
Build

Random memory accesses cause poor CPU cache
performance

8
Cache Partitioning

Recursively produce cache-sized partitions after
I/O partitioning
Avoid cache misses when joining cache-sized
partitions
Overhead of re-partitioning

Memory-sized Partitions
Probe
Build
9
Cache Prefetching

Reduce impact of cache misses
Exploit available memory bandwidth
Overlap cache misses and computations
Insert cache prefetch instructions into code
Still incurs the same number of cache misses

10
Outline

Motivation
Previous hash join algorithms
Hash join performance on SMP systems
Inspector join
Experimental results
Conclusions

11
Hash Joins on SMP Systems

Previous studies mainly focus on uni-processors
Memory bandwidth is precious
It becomes the bottleneck in cache-prefetching
techniques
Each processor joins a pair of partitions in join
phase

12
Inspector Joins

Extracted information summary of matching
relationships
Every K contiguous pages in a build partition
forms a sub-partition
Tells which sub-partition(s) every probe tuple
matches

Probe Partition
Build Partition
Summary of Matching Relationship
13
Cache-Stationary Join Phase

Recall cache partitioning re-partition cost

Copying cost
Copying cost
Build Partition
Probe Partition

We want to achieve zero copying

14
Cache-Stationary Join Phase

Joins a sub-partition and its matching probe
tuples
Sub-partition is small enough to fit in CPU cache
Cache prefetching for the remaining cache misses
Zero copying for generating recursive cache-sized
partitions

Sub-partition 0
Sub-partition 1
Sub-partition 2
Build Partition
Probe Partition
15
Filters in I/O Partitioning

How to extract the summary efficiently?
Extend filter scheme in commercial hash joins
Conventional single-filter scheme
Represent all build join keys
Filter out probe tuples having no matches

Filter
Mem-sized Partitions
Test
Construct
16
Background Bloom Filter

A bit vector
A key is hashed d (e.g. d3) times and
represented by d bits
Construct for every build join key, set its 3
bits in vector
Test given a probe join key, check if all its 3
bits are 1
Discard the tuple if some bits are 0
May have false positives

Bit0H0(key)
Bit1H1(key)
Bit2H2(key)
Filter
17
Multi-Filter Scheme

Single filter a probe tuple ? entire build
relation
Our goal a probe tuple ? sub-partitions
Construct a filter for every sub-partition
Replace a single large filter with multiple small
filters

Single Filter
Multi-Filter
Partition 0
Partition 1
Build Relation
Partition 2
18
Testing Multi-Filters

When partitioning the probe relation
Test a probe tuple against all the filters of a
partition
Tells which sub-partition(s) the tuple may have
matches
Store summary of matching relationships in
partitions
This information is used to extract probe tuples
in the order of partition IDs. A special array is
constructed using count sort technique for this
purpose.

Test
Partition 0
Partition 1
Multi-Filter
Probe Relation
Partition 2
19
Contd

Extracting probe tuple information for every
sub-partition using counting sort
One array for each sub partition. Size of the
array is number of matching probe tuples for that
partition.
The tuples are never visited or copied in the
coutning sort.
Joining pair of build and probe sub-partitions

20
Minimizing Cache Misses for Testing Filters

Single filter scheme
Compute 3 bit positions
Test 3 bits
Multi-filter scheme if there are S
sub-partitions in a partition
Compute 3 bit positions
Test the same 3 bits for every filter, altogether
3S bits
May cause 3S cache misses !

Test
21
Vertical Filters for Testing

Bits at the same position are contiguous in
memory
3 cache misses instead of 3S cache misses!

S filters
Partition 0
Test
Contiguous in memory
Partition 1
Probe Relation
Partition 2

Horizontal ? vertical conversion after
partitioning build relation
Very small overhead in practice

22
Outline

Motivation
Previous hash join algorithms
Hash join performance on SMP systems
Inspector join
Experimental results
Conclusions

23
Experimental Setup

Relation schema 4-byte join attribute fixed
length payload
No selection, no projection
50MB memory per CPU available for the join phase
Same join algorithm run on every CPU joining
different partitions
Detailed cycle-by-cycle simulations
A shared-bus SMP system with 1.5GHz processors
Memory hierarchy is based on Itanium 2 processor

24
Partition Phase Wall-Clock Time

500MB joins 2GB
100B tuples, 4B keys
50 probe tuples no matches
A build matches 2 probe tuples

Number of CPUs used

I/O partitioning can take advantage of multiple
CPUs
Cut input relations into equal-sized chunks
Partition one chunk on every CPU
Concatenate outputs from all CPUs
Enhanced cache partitioning cache partitioning
advanced prefetching
Inspection incurs very small overhead
Ratio of execution time with best algo- 0.88 to
0.94
Mainly computation cost of converting horizontal
filters to vertical and testing

25
Join Phase Aggregate Time

500MB joins 2GB
100B tuples, 4B keys
50 probe tuples no matches
A build matches 2 probe tuples

Number of CPUs used

Inspector join achieves significantly better
performancewhen 8 or more CPUs are used
Because of local optimization catch prefetching
1.7-2.1X speedups over cache prefetching
Memory B/W becomes bottleneck when more no of
processors are used
1.6-2.0X speedups over enhanced cache partitioning

26
Results on Choosing Suitable Join Phase
Join Phase
Simple Hash Join
Inspection
Cache Partitioning
I/O Partitioning
Cache Prefetching
Extracted Info
Cache Stationary

Case 1 a large number of duplicate build join
keys
Choose enhanced cache partitioning
When a probe tuple on average matches 4 or more
sub-partitions
Case 2 nearly sorted input relations
Surprisingly cache-stationary join is very good

27
Conclusions

Exploit multi-pass structure for higher quality
info about data
Achieve significantly better cache performance
1.6X speedups over previous cache-friendly
algorithms
When 8 or more CPUs are used
Choose most suitable algorithms for special input
cases
Idea may be applicable to other multi-pass
algorithms

Thank You !

29
Previous Algorithms on SMP Systems
Wall clock time
Aggregate time on all CPUs

Join phase performance of joining a 500MB and a
2GB relations (details later in the talk)
Aggregate performance degrades dramatically over
4 CPUs
? Reduce data movement (memory to memory, memory
to cache)

30
More Details in Paper

Moderate memory space requirement for filters
Summary information representation in
intermediate partitions
Preprocessing for cache-stationary join phase
Prefetching for improving efficiency and
robustness

31
Partition Phase Wall-Clock Time

500MB joins 2GB
100B tuples, 4B keys
50 probe tuples no matches
A build matches 2 probe tuples

Number of CPUs used

I/O partitioning can take advantage of multiple
CPUs
Cut input relations into equal-sized chunks
Partition one chunk on every CPU
Concatenate outputs from all CPUs
Inspection incurs very small overhead

32
Join Phase Aggregate Time

500MB joins 2GB
100B tuples, 4B keys
50 probe tuples no matches
A build matches 2 probe tuples

Number of CPUs used

Inspector join achieves significantly better
performancewhen 8 or more CPUs are used
1.7-2.1X speedups over cache prefetching
1.6-2.0X speedups over enhanced cache partitioning

33
CPU-Cache-Friendly Hash Joins