Title: Joining Interval Data in Relational Databases
1Joining Interval Data in Relational Databases
- By
- Jost Enderle
- Matthias Hampel
- Thomas Seidl
2Outline
- Problem at hand
- Current solutions
- Important concepts
- Solution to problem at hand
- Some numbers
- Final words
3Problem at hand
- Joining interval data in relational databases
- Interval is duration of period of time
- It is a scalar
- E.g. 24 hours
- Chronological period is a pair of date times
- It is a compound object
- E.g. 2nd January 2004 10th November 2004
- Has definition for Overlaps operator
- Above period Overlaps with 2nd November 2004
3rd December 2004
4Problem at hand
- Overlaps operator formulation
- Where R.period overlaps S.period
- Where R.period.lower lt S.period.upper
- and R.period.upper gt S.period.lower
- P1 2nd January 2004 10th November 2004
- P2 2nd November 2004 3rd December 2004
- 2nd January 2004 lt 3rd December 2004
- And 10th November 2004 gt 2nd November 2004
- R and S can have 1000s of rows and the overlaps
operator needs to be executed 1,000,000 of times
5Problem at hand
- Relational databases support equi-joins
efficiently not Multiple Interdependent Join
predicates - Sort order needed on interval values for sort
merge joins - Here there are 2 such values
- Index can help but it is still costly
- It can provide advantage if Index is designed for
interval data
6Current solutions
- Nested loop algorithms
- Quadratic cost
- E.g. TJ-2
- Assumes inner join relation to be sorted
- Other approaches
- Assume inner join relation can be stored in memory
7Current solutions
- Sort-Merge algorithms
- Need to find sort order on period data
- E.g. TJ-1, multi-predicate merge join
- Relations need to be sorted on lower and upper
points of interval - E.g. TJ-3
- For append only databases
- E.g. Algorithm by Pfoser et al
- For append only databases
- E.g. Algorithm by Zhang et al, Multi predicate
merge join - Does not support an overlaps join predicate
8Current solutions
- Partition Based join
- Partition input relations on join attribute
values and compare tuples in relevant partitions - Something like using MBRs and R trees
- E.g. Algorithms by Snodgrass and Jensen
- Perform poorly for lots of tuples having long
intervals - E.g. Algorithms by Sitzmann and Stuckey
- Optimize performance of above algorithm using
histograms - Difficult to integrate with existing databases
- E.g. Algorithm by lu, Ooi and Tan
- Convert temporal data to spatial data and perform
spatial join - Not efficient and difficult to integrate to
existing databases
9Current solutions
- Index Based joins
- Use indices defined on join attributes for
efficient retrieval - Need to check if the index can be integrated in
the existing databases - E.g. Algorithm by Zhang, Tsotras and Seeger
- Efficient implementation by MVBT (Multi version
B tree) index - Difficult to integrate with existing databases
- Complex algorithms!!
10Important concepts
- Relational Interval Tree
- Binary tree of height h covers range 1, 2h -1
of interval bounds. - Tree is virtual, traversal of tree is purely
arithmetic - Each node represents an instant of time
- Each non-leaf node represents an interval for the
entire sub-tree rooted at it
11Important concepts
- Relational Interval (RI) tree
12Important concepts
13Important concepts
- Node 4 represents the instant 4, 21 represents
instant 21
14Important concepts
- Node 2 represents interval 1, 3
15Important concepts
- Node 12 represents interval 9,15,10,14,11,14
etc
16Important concepts
- Node 16 represents interval 1, 31, etc
17Important concepts
- Mapping of an interval to a node. E.g. 2, 13
- It is the root of the smallest sub-tree including
the nodes 2 and 13
18Important concepts
- It is the root of the smallest sub-tree including
the nodes 2 and 13 i.e. Node 8
19Important concepts
- Database indices of RI tree
- Two Indices one on the upper and the other on the
lower interval values - Values in the Lower index stores has the
following structure - Structure is indexed by the lower interval value
- Node is the interval Node computed as previously
shown 8 for 2, 13 - Lower is the lower bound of the interval i.e. 2
- ID is the primary key of the record in the table
20Important concepts
- Database indices of RI tree
- Similarly for Upper index
- Node is the interval Node computed as previously
shown 8 for 2, 13 - Upper is the upper bound of the interval i.e. 13
- ID is the primary key of the record in the table
21Important concepts
- Using Relational Interval (RI) tree
- Record with ID Mary has interval 2, 13
- Record with ID John has interval 4, 23
- Record with ID Bob has interval 10, 21
- Record with ID Ann has interval 21, 30
22Important Concepts
- Left queries, Right queries and Inner Queries
- For any Range lower, upper , for all the nodes
on the path from root to lower and upper - Nodes registered to the left of lower may
intersect the range - left queries - Nodes registered to the right of upper may
intersect the range right queries - Nodes registered in lower, upper DEFINITELY
intersect the range inner queries
23Important Concepts
- For any Range 11, 13 , for all the nodes on
the path from root to lower and upper i.e. 16 to
11 and 16 to 13
24Important Concepts
- Nodes registered to the left of lower may
intersect the range - left queries here 8, 10 - Note here that range 1, 9 and 1, 13 are
both registered at 8, however only 1, 13
intersects with 11, 13
25Important Concepts
- Nodes registered to the right of upper may
intersect the range right queries here 14, 16
26Important Concepts
- Nodes registered in lower, upper DEFINITELY
intersect the range inner queries here 11, 13
27Solution to problem at hand
- Determine the IDs of all the tuples that
intersect with a given range lower, upper - Determine left queries
- Get the IDs of the tuples from left queries that
intersect with range lower, upper - Determine right queries
- Get the IDs of the tuples from right queries that
intersect with range lower, upper - Determine inner queries
- Get IDs of all the tuples from inner queries
- Return union of all the above
28Solution to problem at hand
- Determine the IDs of all the tuples that
intersect with a given range lower, upper - Determine left queries
- Get the IDs of the tuples from left queries that
intersect with range lower, upper - Determine right queries
- Get the IDs of the tuples from right queries that
intersect with range lower, upper - Determine inner queries
- Get IDs of all the tuples from inner queries
- Return union of all the above
29Solution to problem at hand
- The final intersection query
30Solution to problem at hand
- Note here that range 1, 9 and 1, 13 are
both registered at 8, however only 1, 13
intersects with 11, 13
31Solution to problem at hand
- Both 1, 9 and 1, 13 are picked on condition 1
- But only 1, 13 passes condition 2
- Since upper of 1, 13 gt lower of query 11, 13
- i.e. 13 gt 11
32Solution to problem at hand
- Condition 1 does filtering and condition 2 does
range comparison - Less tuples have to be compared for range
intersection here due to filtering lowering time
for query evaluation
33Solution to problem at hand
- Index based loop join
- Use the OVERLAPS algorithm defined previously to
basic loop join algorithm - For each tuple in the outer relation, project the
all tuples that OVERLAP with the inner relation - Approach is naïve, does not take partitioning
into consideration
34Solution to problem at hand
- Partition based join
- In an RI tree each interval in a tuple is
assigned to exactly one node of the tree - Some observations can be made regarding which
nodes of the second relation join with a given
node of the first relation
35Solution to problem at hand
- For all tuples of relation R registered at node
nR (node value n at relation R) , the tuples
registered at the following nodes in S may join
with them - Nodes on the path from the root of the tree to
the node nS (called UP nodes) - Node nS ( definite overlap)
- All nodes in the sub tree of nS (called down
nodes)
36Solution to problem at hand
- Here nR ( nS) 12
- The shaded region in S represents node that may
join with nR
37Solution to problem at hand
- UP nodes 8, 16
- DOWN nodes 9, 10, 11, 13, 14, 15
38Solution to problem at hand
- Depending on the manner in which the two
relations are partitioned, we have three
approaches - RI-Tree Up-Down join
- RI-Tree Down-Down join
- RI-Tree Up-Up join
- We will discuss the Up-Down approach in greater
detail
39Solution to problem at hand
- Steps for RI-Tree Up-Down join approach
- Determine all nodes in relation R
- Determine the max-range for all the nodes in
relation R - Alternatively you can also find the exact-range
for the nodes - Determine all join partners in UP of nS
- Determine all join partners in DOWN of nS
- Generate the query performing the join
40Solution to problem at hand
- Determine all nodes in relation R
- Note that the Index structure is actually
materialized in the database so you get all the
nodes that are actually present in the database
41Solution to problem at hand
- Determine the max-range for all the nodes in
relation R - max-range is fn_lower, fn_upper
- fn_lower nR step 1
- fn_upper nR step 1
- step abs( nR parent( nR ))
- For nR 12
- step abs( 12 8 ) 4
- fn_lower 12 4 1 9
- fn_upper 12 4 1 15
- max-range 9, 15
42Solution to problem at hand
- Alternatively you can also find the exact-range
for the nodes
- This will be a closer bound for range covered by
node R since it is dynamically determined based
on values in the database - Even if node 12 in theory may cover the range 9,
15 the table R may only have tuples in range
11, 15 registered at node 12. - Here you will get 11, 15, in the previous case
you will get 9, 15
43Solution to problem at hand
- Determine all join partners in UP of nS
- For all nodes in UP( nS )
- If the node is less than ns add it to left
queries - If the node is greater than ns add it to right
queries
44Solution to problem at hand
- Determine all join partners in DOWN of nS
- Nodes on the path of nS to fn_lower are stored as
left queries - Nodes on the path of nS to fn_upper are stored as
right queries - Range between fn_lower, ns -1 is stored as
left query - Range between ns 1, fn_upper is stored as
right query
45Solution to problem at hand
- Left query schema
- (forknode, from, to) from and to define the
range - Similarly, Right query schema
- (forknode, from, to) from and to define the
range
46Solution to problem at hand
- (12, 8, 8)
- (12, 10, 10)
- (12, 9, 11)
- The following left queries are generated for nS
12
47Solution to problem at hand
- Generate the query performing the join
- Project ID pairs (R.ID, S.ID) for all the tuples
where nR nS - Project ID pairs for all left queries of nS where
upper portion of interval S.period intersects
with lower portion of interval R.period - Project ID pairs for all right queries of nS
where lower portion of interval S.period
intersects with upper portion of interval R.period
48Solution to problem at hand
- Tuples registered at the same node in R and S
definitely overlap and so project them
unconditionally
- Intersection guaranteed because range of tuples
registered at node n intersects with instant n - I.E. instant n in period fn_lower, fn_upper
49Solution to problem at hand
- Project ID pairs for all left queries of nS where
upper portion of interval S.period intersects
with lower portion of interval R.period
- Intervals registered at Nodes corresponding to
Left queries of S does not include instant nR.
However, the interval may match the interval
registered at nR.
- Similarly for right queries for nS
50Solution to problem at hand
- The finally generated query
51Solution to problem at hand
- The Down-Down strategy, briefly
52Solution to problem at hand
- The Up-Up strategy, briefly
53Some numbers
- Experimental setup
- Integration with Oracle Server Release 9.2.0
- Pentium 4 with 2.4 GHZ
- 512 MB of RAM
- EIDE hard drive
- 200 database blocks cache
- Block size 2kB
54Some numbers
55Some numbers
56Some numbers
57Some numbers
58Some numbers
59Some numbers
60Some numbers
61Final words
- Number indicate
- Approach can be used on an existing database
- RI-tree UP-DOWN method outperforms standard
B-tree methods - RI-tree UP-DOWN method scales well and works
equally well on real data
62Final words
- Future work
- Cost models for prediction of performance based
on data characteristics
63References
64References
- Bercken J. v. d., Seeger B. Query Processing
Techniques for Multiversion Access Methods. VLDB
1996, 168-179. - Sitzmann I., Stuckey P.J. Improving Temporal
Joins Using Histograms. DEXA 2000, 488-498.
65Questions
- Is a RI tree dynamically extensible
- YES. Since it grows at the roots
- How much memory will a RI tree with 123562 nodes
require - 0 Bytes. Since, it is a virtual tree.
- Why are all you guys still here!!!