Inspector Joins - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Inspector Joins

Description:

Inspector Joins IC-65 Advances in Data Management Systems. Inspector Joins ... Ratio of execution time with best algo- 0.88 to 0.94 ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 38
Provided by: lsirpeo
Category:
Tags: algo | inspector | joins

less

Transcript and Presenter's Notes

Title: Inspector Joins


1
Inspector Joins
  • By Shimin Chen, Anastassia Ailamaki, Phillip, and
    Todd C. Mowry
  • VLDB 2005

Rammohan Narendula
2
Introduction
Query execution is I/O bound- so most of
the research concentrates on main memory Goal-
reduce no. of page faults thus reduce no. of
disk I/Os
However, hash join is a special class of
techniques where hash-join becomes CPU bound
given sufficient I/O bandwidth and
employing Advanced I/O techniques (I/O
prefetching) Goal- reduce no. of cache
misses
3
Exploiting Information about Data
  • Ability to improve query depends on information
    quality
  • General stats on relations are inadequate
  • May lead to incorrect decisions for specific
    queries
  • Especially true for join queries
  • Previous approaches exploiting dynamic
    information
  • Collecting information from previous queries
  • Multi-query optimization Sellis88
  • Materialized views Blakeley et al. 86
  • Join indices Valduriez87
  • Dynamic re-optimization of query plans
    KabraDeWitt98 Markl et al. 04
  • ?This study exploits the inner structure of hash
    joins

4
Exploiting Multi-Pass Structure of Hash Joins
  • Idea
  • Examine the actual data in I/O partitioning phase
  • Extract useful information to improve join phase

Inspection
5
Using Extracted Information
  • Enable a new join phase algorithm
  • Reduce the primary performance bottleneck in hash
    joins
  • i.e. Poor CPU cache performance
  • Optimized for multi-processor systems
  • Choose the most suitable join phase algorithm for
    special input cases

Join Phase
Simple Hash Join
Inspection
Cache Partitioning
I/O Partitioning
Cache Prefetching
Extracted Information
New Algorithm
6
Outline
  • Motivation
  • Previous hash join algorithms
  • Hash join performance on SMP systems
  • Inspector join
  • Experimental results
  • Conclusions

7
GRACE Hash Join
  • I/O Partitioning Phase
  • Divide input relations into partitions with a
    hash function

Probe
Build
Over 70 execution time stalled on cache misses!
  • Join Phase (simple hash join)
  • Build hash table, then probe hash table

Probe
Build
  • Random memory accesses cause poor CPU cache
    performance

8
Cache Partitioning
  • Recursively produce cache-sized partitions after
    I/O partitioning
  • Avoid cache misses when joining cache-sized
    partitions
  • Overhead of re-partitioning

Memory-sized Partitions
Probe
Build
9
Cache Prefetching
  • Reduce impact of cache misses
  • Exploit available memory bandwidth
  • Overlap cache misses and computations
  • Insert cache prefetch instructions into code
  • Still incurs the same number of cache misses

10
Outline
  • Motivation
  • Previous hash join algorithms
  • Hash join performance on SMP systems
  • Inspector join
  • Experimental results
  • Conclusions

11
Hash Joins on SMP Systems
  • Previous studies mainly focus on uni-processors
  • Memory bandwidth is precious
  • It becomes the bottleneck in cache-prefetching
    techniques
  • Each processor joins a pair of partitions in join
    phase

12
Inspector Joins
  • Extracted information summary of matching
    relationships
  • Every K contiguous pages in a build partition
    forms a sub-partition
  • Tells which sub-partition(s) every probe tuple
    matches

Probe Partition
Build Partition
Summary of Matching Relationship
13
Cache-Stationary Join Phase
  • Recall cache partitioning re-partition cost

Copying cost
Copying cost
Build Partition
Probe Partition
  • We want to achieve zero copying

14
Cache-Stationary Join Phase
  • Joins a sub-partition and its matching probe
    tuples
  • Sub-partition is small enough to fit in CPU cache
  • Cache prefetching for the remaining cache misses
  • Zero copying for generating recursive cache-sized
    partitions

Sub-partition 0
Sub-partition 1
Sub-partition 2
Build Partition
Probe Partition
15
Filters in I/O Partitioning
  • How to extract the summary efficiently?
  • Extend filter scheme in commercial hash joins
  • Conventional single-filter scheme
  • Represent all build join keys
  • Filter out probe tuples having no matches

Filter
Mem-sized Partitions
Test
Construct
16
Background Bloom Filter
  • A bit vector
  • A key is hashed d (e.g. d3) times and
    represented by d bits
  • Construct for every build join key, set its 3
    bits in vector
  • Test given a probe join key, check if all its 3
    bits are 1
  • Discard the tuple if some bits are 0
  • May have false positives

Bit0H0(key)
Bit1H1(key)
Bit2H2(key)
Filter
17
Multi-Filter Scheme
  • Single filter a probe tuple ? entire build
    relation
  • Our goal a probe tuple ? sub-partitions
  • Construct a filter for every sub-partition
  • Replace a single large filter with multiple small
    filters

Single Filter
Multi-Filter
Partition 0
Partition 1
Build Relation
Partition 2
18
Testing Multi-Filters
  • When partitioning the probe relation
  • Test a probe tuple against all the filters of a
    partition
  • Tells which sub-partition(s) the tuple may have
    matches
  • Store summary of matching relationships in
    partitions
  • This information is used to extract probe tuples
    in the order of partition IDs. A special array is
    constructed using count sort technique for this
    purpose.

Test
Partition 0
Partition 1
Multi-Filter
Probe Relation
Partition 2
19
Contd
  • Extracting probe tuple information for every
    sub-partition using counting sort
  • One array for each sub partition. Size of the
    array is number of matching probe tuples for that
    partition.
  • The tuples are never visited or copied in the
    coutning sort.
  • Joining pair of build and probe sub-partitions

20
Minimizing Cache Misses for Testing Filters
  • Single filter scheme
  • Compute 3 bit positions
  • Test 3 bits
  • Multi-filter scheme if there are S
    sub-partitions in a partition
  • Compute 3 bit positions
  • Test the same 3 bits for every filter, altogether
    3S bits
  • May cause 3S cache misses !

Test
21
Vertical Filters for Testing
  • Bits at the same position are contiguous in
    memory
  • 3 cache misses instead of 3S cache misses!

S filters
Partition 0
Test
Contiguous in memory
Partition 1
Probe Relation
Partition 2
  • Horizontal ? vertical conversion after
    partitioning build relation
  • Very small overhead in practice

22
Outline
  • Motivation
  • Previous hash join algorithms
  • Hash join performance on SMP systems
  • Inspector join
  • Experimental results
  • Conclusions

23
Experimental Setup
  • Relation schema 4-byte join attribute fixed
    length payload
  • No selection, no projection
  • 50MB memory per CPU available for the join phase
  • Same join algorithm run on every CPU joining
    different partitions
  • Detailed cycle-by-cycle simulations
  • A shared-bus SMP system with 1.5GHz processors
  • Memory hierarchy is based on Itanium 2 processor

24
Partition Phase Wall-Clock Time
  • 500MB joins 2GB
  • 100B tuples, 4B keys
  • 50 probe tuples no matches
  • A build matches 2 probe tuples

Number of CPUs used
  • I/O partitioning can take advantage of multiple
    CPUs
  • Cut input relations into equal-sized chunks
  • Partition one chunk on every CPU
  • Concatenate outputs from all CPUs
  • Enhanced cache partitioning cache partitioning
    advanced prefetching
  • Inspection incurs very small overhead
  • Ratio of execution time with best algo- 0.88 to
    0.94
  • Mainly computation cost of converting horizontal
    filters to vertical and testing

25
Join Phase Aggregate Time
  • 500MB joins 2GB
  • 100B tuples, 4B keys
  • 50 probe tuples no matches
  • A build matches 2 probe tuples

Number of CPUs used
  • Inspector join achieves significantly better
    performancewhen 8 or more CPUs are used
  • Because of local optimization catch prefetching
  • 1.7-2.1X speedups over cache prefetching
  • Memory B/W becomes bottleneck when more no of
    processors are used
  • 1.6-2.0X speedups over enhanced cache partitioning

26
Results on Choosing Suitable Join Phase
Join Phase
Simple Hash Join
Inspection
Cache Partitioning
I/O Partitioning
Cache Prefetching
Extracted Info
Cache Stationary
  • Case 1 a large number of duplicate build join
    keys
  • Choose enhanced cache partitioning
  • When a probe tuple on average matches 4 or more
    sub-partitions
  • Case 2 nearly sorted input relations
  • Surprisingly cache-stationary join is very good

27
Conclusions
  • Exploit multi-pass structure for higher quality
    info about data
  • Achieve significantly better cache performance
  • 1.6X speedups over previous cache-friendly
    algorithms
  • When 8 or more CPUs are used
  • Choose most suitable algorithms for special input
    cases
  • Idea may be applicable to other multi-pass
    algorithms

28
  • Thank You !

29
Previous Algorithms on SMP Systems
Wall clock time
Aggregate time on all CPUs
  • Join phase performance of joining a 500MB and a
    2GB relations (details later in the talk)
  • Aggregate performance degrades dramatically over
    4 CPUs
  • ? Reduce data movement (memory to memory, memory
    to cache)

30
More Details in Paper
  • Moderate memory space requirement for filters
  • Summary information representation in
    intermediate partitions
  • Preprocessing for cache-stationary join phase
  • Prefetching for improving efficiency and
    robustness

31
Partition Phase Wall-Clock Time
  • 500MB joins 2GB
  • 100B tuples, 4B keys
  • 50 probe tuples no matches
  • A build matches 2 probe tuples

Number of CPUs used
  • I/O partitioning can take advantage of multiple
    CPUs
  • Cut input relations into equal-sized chunks
  • Partition one chunk on every CPU
  • Concatenate outputs from all CPUs
  • Inspection incurs very small overhead

32
Join Phase Aggregate Time
  • 500MB joins 2GB
  • 100B tuples, 4B keys
  • 50 probe tuples no matches
  • A build matches 2 probe tuples

Number of CPUs used
  • Inspector join achieves significantly better
    performancewhen 8 or more CPUs are used
  • 1.7-2.1X speedups over cache prefetching
  • 1.6-2.0X speedups over enhanced cache partitioning

33
CPU-Cache-Friendly Hash Joins
  • Recent studies focus on CPU cache performance
  • I/O partitioning gives good I/O performance
  • Random memory accesses cause poor CPU cache
    performance
  • Cache Partitioning Shatdal et al. 94 Boncz et
    al.99 Manegold et al.00
  • Recursively produce cache-sized partitions from
    memory-sized partitions
  • Avoid cache misses during join phase
  • Pay re-partitioning cost
  • Cache Prefetching Chen et al. 04
  • Exploit memory system parallelism
  • Use prefetches to overlap multiple cache misses
    and computations

34
Example Special Input Cases
  • Example case 1 a large number of duplicate
    build join keys
  • Count the average number of sub-partitions a
    probe tuple matches
  • Must check the tuple against all possible
    sub-partitions
  • If too large, cache stationary join works poorly
  • Example case 2 nearly sorted input relations
  • A merge-based join phase might be better?

A probe tuple
Build Partition
Sub-partition 0
Probe Partition
Sub-partition 1
Sub-partition 2
35
Varying Number of Duplicates per Build Join Key
  • Join phase aggregate performance
  • Choose enhanced cache part
  • When a probe tuple on average matches 4 or more
    sub-partitions

36
Nearly Sorted Cases
  • Sort both input relations, then randomly move
    0-5 of tuples
  • Join phase aggregate performance
  • Surprisingly cache-stationary join is very good
  • Even better than merge join when over 1 tuples
    are out-of-order

37
Analyzing Nearly Sorted Case
  • Partitions are also nearly sorted
  • Probe tuples matching a sub-partition are almost
    contiguous
  • Similar memory behavior as merge join
  • No cost for sorting out-of-order tuples

A probe tuple
Build Partition
Sub-partition 0
Probe Partition
Sub-partition 1
Sub-partition 2
Nearly Sorted
Nearly Sorted
Write a Comment
User Comments (0)
About PowerShow.com