Title: Inspector Joins
1Inspector Joins
- By Shimin Chen, Anastassia Ailamaki, Phillip, and
Todd C. Mowry - VLDB 2005
Rammohan Narendula
2Introduction
Query execution is I/O bound- so most of
the research concentrates on main memory Goal-
reduce no. of page faults thus reduce no. of
disk I/Os
However, hash join is a special class of
techniques where hash-join becomes CPU bound
given sufficient I/O bandwidth and
employing Advanced I/O techniques (I/O
prefetching) Goal- reduce no. of cache
misses
3Exploiting Information about Data
- Ability to improve query depends on information
quality - General stats on relations are inadequate
- May lead to incorrect decisions for specific
queries - Especially true for join queries
- Previous approaches exploiting dynamic
information - Collecting information from previous queries
- Multi-query optimization Sellis88
- Materialized views Blakeley et al. 86
- Join indices Valduriez87
- Dynamic re-optimization of query plans
KabraDeWitt98 Markl et al. 04 - ?This study exploits the inner structure of hash
joins
4Exploiting Multi-Pass Structure of Hash Joins
- Idea
- Examine the actual data in I/O partitioning phase
- Extract useful information to improve join phase
Inspection
5Using Extracted Information
- Enable a new join phase algorithm
- Reduce the primary performance bottleneck in hash
joins - i.e. Poor CPU cache performance
- Optimized for multi-processor systems
- Choose the most suitable join phase algorithm for
special input cases
Join Phase
Simple Hash Join
Inspection
Cache Partitioning
I/O Partitioning
Cache Prefetching
Extracted Information
New Algorithm
6Outline
- Motivation
- Previous hash join algorithms
- Hash join performance on SMP systems
- Inspector join
- Experimental results
- Conclusions
7GRACE Hash Join
- I/O Partitioning Phase
- Divide input relations into partitions with a
hash function
Probe
Build
Over 70 execution time stalled on cache misses!
- Join Phase (simple hash join)
- Build hash table, then probe hash table
Probe
Build
- Random memory accesses cause poor CPU cache
performance
8Cache Partitioning
- Recursively produce cache-sized partitions after
I/O partitioning - Avoid cache misses when joining cache-sized
partitions - Overhead of re-partitioning
Memory-sized Partitions
Probe
Build
9Cache Prefetching
- Reduce impact of cache misses
- Exploit available memory bandwidth
- Overlap cache misses and computations
- Insert cache prefetch instructions into code
- Still incurs the same number of cache misses
10Outline
- Motivation
- Previous hash join algorithms
- Hash join performance on SMP systems
- Inspector join
- Experimental results
- Conclusions
11Hash Joins on SMP Systems
- Previous studies mainly focus on uni-processors
- Memory bandwidth is precious
- It becomes the bottleneck in cache-prefetching
techniques - Each processor joins a pair of partitions in join
phase
12 Inspector Joins
- Extracted information summary of matching
relationships - Every K contiguous pages in a build partition
forms a sub-partition - Tells which sub-partition(s) every probe tuple
matches
Probe Partition
Build Partition
Summary of Matching Relationship
13Cache-Stationary Join Phase
- Recall cache partitioning re-partition cost
Copying cost
Copying cost
Build Partition
Probe Partition
- We want to achieve zero copying
14Cache-Stationary Join Phase
- Joins a sub-partition and its matching probe
tuples - Sub-partition is small enough to fit in CPU cache
- Cache prefetching for the remaining cache misses
- Zero copying for generating recursive cache-sized
partitions
Sub-partition 0
Sub-partition 1
Sub-partition 2
Build Partition
Probe Partition
15Filters in I/O Partitioning
- How to extract the summary efficiently?
- Extend filter scheme in commercial hash joins
- Conventional single-filter scheme
- Represent all build join keys
- Filter out probe tuples having no matches
Filter
Mem-sized Partitions
Test
Construct
16Background Bloom Filter
- A bit vector
- A key is hashed d (e.g. d3) times and
represented by d bits - Construct for every build join key, set its 3
bits in vector - Test given a probe join key, check if all its 3
bits are 1 - Discard the tuple if some bits are 0
- May have false positives
Bit0H0(key)
Bit1H1(key)
Bit2H2(key)
Filter
17 Multi-Filter Scheme
- Single filter a probe tuple ? entire build
relation - Our goal a probe tuple ? sub-partitions
- Construct a filter for every sub-partition
- Replace a single large filter with multiple small
filters
Single Filter
Multi-Filter
Partition 0
Partition 1
Build Relation
Partition 2
18 Testing Multi-Filters
- When partitioning the probe relation
- Test a probe tuple against all the filters of a
partition - Tells which sub-partition(s) the tuple may have
matches - Store summary of matching relationships in
partitions - This information is used to extract probe tuples
in the order of partition IDs. A special array is
constructed using count sort technique for this
purpose.
Test
Partition 0
Partition 1
Multi-Filter
Probe Relation
Partition 2
19Contd
- Extracting probe tuple information for every
sub-partition using counting sort - One array for each sub partition. Size of the
array is number of matching probe tuples for that
partition. - The tuples are never visited or copied in the
coutning sort. - Joining pair of build and probe sub-partitions
20Minimizing Cache Misses for Testing Filters
- Single filter scheme
- Compute 3 bit positions
- Test 3 bits
- Multi-filter scheme if there are S
sub-partitions in a partition - Compute 3 bit positions
- Test the same 3 bits for every filter, altogether
3S bits - May cause 3S cache misses !
Test
21Vertical Filters for Testing
- Bits at the same position are contiguous in
memory - 3 cache misses instead of 3S cache misses!
S filters
Partition 0
Test
Contiguous in memory
Partition 1
Probe Relation
Partition 2
- Horizontal ? vertical conversion after
partitioning build relation - Very small overhead in practice
22Outline
- Motivation
- Previous hash join algorithms
- Hash join performance on SMP systems
- Inspector join
- Experimental results
- Conclusions
23Experimental Setup
- Relation schema 4-byte join attribute fixed
length payload - No selection, no projection
- 50MB memory per CPU available for the join phase
- Same join algorithm run on every CPU joining
different partitions - Detailed cycle-by-cycle simulations
- A shared-bus SMP system with 1.5GHz processors
- Memory hierarchy is based on Itanium 2 processor
24Partition Phase Wall-Clock Time
- 500MB joins 2GB
- 100B tuples, 4B keys
- 50 probe tuples no matches
- A build matches 2 probe tuples
Number of CPUs used
- I/O partitioning can take advantage of multiple
CPUs - Cut input relations into equal-sized chunks
- Partition one chunk on every CPU
- Concatenate outputs from all CPUs
- Enhanced cache partitioning cache partitioning
advanced prefetching - Inspection incurs very small overhead
- Ratio of execution time with best algo- 0.88 to
0.94 - Mainly computation cost of converting horizontal
filters to vertical and testing
25Join Phase Aggregate Time
- 500MB joins 2GB
- 100B tuples, 4B keys
- 50 probe tuples no matches
- A build matches 2 probe tuples
Number of CPUs used
- Inspector join achieves significantly better
performancewhen 8 or more CPUs are used - Because of local optimization catch prefetching
- 1.7-2.1X speedups over cache prefetching
- Memory B/W becomes bottleneck when more no of
processors are used - 1.6-2.0X speedups over enhanced cache partitioning
26Results on Choosing Suitable Join Phase
Join Phase
Simple Hash Join
Inspection
Cache Partitioning
I/O Partitioning
Cache Prefetching
Extracted Info
Cache Stationary
- Case 1 a large number of duplicate build join
keys - Choose enhanced cache partitioning
- When a probe tuple on average matches 4 or more
sub-partitions - Case 2 nearly sorted input relations
- Surprisingly cache-stationary join is very good
27Conclusions
- Exploit multi-pass structure for higher quality
info about data - Achieve significantly better cache performance
- 1.6X speedups over previous cache-friendly
algorithms - When 8 or more CPUs are used
- Choose most suitable algorithms for special input
cases - Idea may be applicable to other multi-pass
algorithms
28 29Previous Algorithms on SMP Systems
Wall clock time
Aggregate time on all CPUs
- Join phase performance of joining a 500MB and a
2GB relations (details later in the talk) - Aggregate performance degrades dramatically over
4 CPUs - ? Reduce data movement (memory to memory, memory
to cache)
30More Details in Paper
- Moderate memory space requirement for filters
- Summary information representation in
intermediate partitions - Preprocessing for cache-stationary join phase
- Prefetching for improving efficiency and
robustness
31Partition Phase Wall-Clock Time
- 500MB joins 2GB
- 100B tuples, 4B keys
- 50 probe tuples no matches
- A build matches 2 probe tuples
Number of CPUs used
- I/O partitioning can take advantage of multiple
CPUs - Cut input relations into equal-sized chunks
- Partition one chunk on every CPU
- Concatenate outputs from all CPUs
- Inspection incurs very small overhead
32Join Phase Aggregate Time
- 500MB joins 2GB
- 100B tuples, 4B keys
- 50 probe tuples no matches
- A build matches 2 probe tuples
Number of CPUs used
- Inspector join achieves significantly better
performancewhen 8 or more CPUs are used - 1.7-2.1X speedups over cache prefetching
- 1.6-2.0X speedups over enhanced cache partitioning
33CPU-Cache-Friendly Hash Joins
- Recent studies focus on CPU cache performance
- I/O partitioning gives good I/O performance
- Random memory accesses cause poor CPU cache
performance - Cache Partitioning Shatdal et al. 94 Boncz et
al.99 Manegold et al.00 - Recursively produce cache-sized partitions from
memory-sized partitions - Avoid cache misses during join phase
- Pay re-partitioning cost
- Cache Prefetching Chen et al. 04
- Exploit memory system parallelism
- Use prefetches to overlap multiple cache misses
and computations
34Example Special Input Cases
- Example case 1 a large number of duplicate
build join keys - Count the average number of sub-partitions a
probe tuple matches - Must check the tuple against all possible
sub-partitions - If too large, cache stationary join works poorly
- Example case 2 nearly sorted input relations
- A merge-based join phase might be better?
A probe tuple
Build Partition
Sub-partition 0
Probe Partition
Sub-partition 1
Sub-partition 2
35Varying Number of Duplicates per Build Join Key
- Join phase aggregate performance
- Choose enhanced cache part
- When a probe tuple on average matches 4 or more
sub-partitions
36Nearly Sorted Cases
- Sort both input relations, then randomly move
0-5 of tuples - Join phase aggregate performance
- Surprisingly cache-stationary join is very good
- Even better than merge join when over 1 tuples
are out-of-order
37Analyzing Nearly Sorted Case
- Partitions are also nearly sorted
- Probe tuples matching a sub-partition are almost
contiguous - Similar memory behavior as merge join
- No cost for sorting out-of-order tuples
A probe tuple
Build Partition
Sub-partition 0
Probe Partition
Sub-partition 1
Sub-partition 2
Nearly Sorted
Nearly Sorted