Sample Database - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Sample Database

Description:

Query: Find hotels that have good balance of stars, distance, price. Dominated Tuples ... Keep repeating BEST till all remaining tuples have been exhausted ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 29

Provided by: romil5

Category:

more less

Transcript and Presenter's Notes

Title: Sample Database

1
Introduction to Skyline Query Processing
- Romil Jain
2
Introduction
Query Find hotels that have good balance of
stars, distance, price
3
Formal Definition
In simple words, a Skyline is the set of all
non-dominated tuples
T set of tuples n Tuples k dimensions (or
columns or attributes) for each tuple
Skyline S of T is s ? T ? t ? T, ? i ? 1..k
ti ? si and ? i ? 1..k ti ? si Where, ti
and si are the values of the ith column of tuples
t s respectively.
4
Example
How many of you compared every tuple with every
other tuple?

Database (1,2,3,4), (1,2,3,3), (1,2,3,4),
(2,3,3,1)
T1 T2 T3 T4

T1 dominates T2 T1 does not dominate T3 T1
and T4 are incomparable ? remember this for
later!
What is the Skyline ?
Answer (1,2,3,4), (1,2,3,4), (2,3,3,1) T1
T3 T4
5
Brute Force Algorithm

Compare every tuple with all other tuples on all
k columns
Return tuples that are not dominated by any tuple
Clearly ?(n2)

Can we do better (at least in few cases)?
One obvious way is to stop early i.e stop
comparing as soon as the current tuple gets
dominated. ?(n2)
6
Make Life Easier
T1 ? T2 means T1 dominates T2 T1 ? T2 means
T1 and T2 incomparable

Transitivity If T1 ? T2 and T2 ? T3 then T1 ? T3
Use this property to reduce comparisons
7
BEST Algorithm (Torlone et. al.1
6) Report the Tj as a skyline member
8
BEST Algorithm (cont)

After first round we have fewer tuples left
Keep repeating BEST till all remaining tuples
have been exhausted

We reduce comparisons by eliminating tuples
Running time ?
Best case (only 1 maximal) ? (kn)
Worst case (All tuples are equal) ? (kn2)
Average case ? (kmn)
Where m maximals
9
What is the Maximals?
We estimate this value by assuming certain nice
properties on the input

Sparseness No duplicate values over columns
Independence Tuples are not correlated or
anti-correlated

10
What is the Maximals (cont) ?
The estimated value sk,n for k ? 1 and n ? 0
obeys the following recurrence relation
sk,n 1/n (sk-1,n ) sk,n-1
11
What is the Maximals (cont) ?
sk,n 1/n (sk-1,n ) sk,n-1

The Harmonic of n, for n ? 0

12
What is the Maximals (cont) ?
sk,n 1/n (sk-1,n ) sk,n-1
It can be shown that Sk,n Hk-1,n
In 2, Godfrey has done a comprehensive study on
maximals.
13
Double Divide Conquer Algorithm
3) Maximals in B can never dominate maximals in
A, because the tuples were already sorted along
d0.
4) Call DDC on maximals(A) ? maximals(B) over
reduced dimensions (d1 - dk ).
14
Double Divide Conquer Algorithm
Running time ?
Best case (only 1 maximal) ? (kn log(n))
Worst case (E.g. All tuples are equal) ? (n log
k-2(n))
15
Worst Cases

All tuples are equal. maximals n. E.g.

2) Anti-Correlated data. maximals n. E.g.
16
Lessons Learnt

From BEST Eliminating tuples helps to reduce
comparisons (in non-worse cases)

2) From DDC Sorting and reducing the
dimensionality helps in reducing the number of
comparisons.
Further optimization Eliminate tuples as early
as possible.
17
Can we sort cleverly?
Assume normalized values (i.e. 0..1) for all
columns. Entropy of tuple t, E(t)
n
?
ti
i1
i.e. Entropy of tuple t is simply the product of
all its normalized column values.
18
Can we sort cleverly (contd)?
Sort by entropy (decreasing)

Observations
A tuple with higher entropy can never be
dominated by a lower one.

2) Tuples with higher entropies have greater
chance to eliminate tuples with lower entropies.
19
Sort Filter Skyline (SFS) Algorithm (Godfrey et.
al.4)
Window in main memory
Write overflows to a file
Entropy-sorted tuple stream
Drop any dominated tuples
After 1st pass use the overflow as the input to
SFS
20
SFS Time Complexity
Best case (only 1 maximal) ? (nlog(n) kn)
Worst case (E.g. All tuples are equal) ?(n2)

Average case ? (nlog(n) kn)
How?
Sorting ? ( nlog(n)) ?
Maximal to maximal comparisons m(m-1)/2 i.e.
o(n) X
Non-maximal to maximal comparison ? (kn) ?

Can we do better than this (at least in non-worse
cases)?
21
LESS Algorithm (Godfrey et. al. 5)
Key idea is to eliminate while sorting. In the
first pass
Window in main memory
Write sorted runs to files
EF Window
Unsorted tuple stream
Buffers for Quick Sort
Drop any dominated tuples
After 1st pass merge these sorted runs
22
LESS Algorithm (contd)
In the final pass
Window in main memory
Write overflows to files
SF Window
Buffer Pool for sorting
Drop any dominated tuples
If overflows are there, then the next step is
same as SFS
23
Time Complexity Comparison
Can we do better than this in worst cases?
24
The Zoo of Algorithms
There are many algorithms with their own USPs.
DDC, LDC, FLET, SDC, BNL, BEST, SFS, LESS
LESS is the best Skyline algorithm to our
knowledge
In 3, Godfrey et. al. have done a comprehensive
study on their running times for best, average
worst cases and shown that LESS performs the best.
25
Improving the worst-case
Current focus of our research.
Worst case complexity is ? (kn2) i.e. We are
comparing every tuple against every other tuple.
Can we avoid this ?
26
Incomparable Tuples
Assume worst case, where tuples are completely
anti-correlated. Still, tuples can be divided
into regions (Lee et. al. 5)
Key idea Do not compare tuples in Region II
Region III
27
Conclusion

Skyline computation is increasingly becoming a
hot research field in Database Community.

Interesting theoretical problem Easy to
understand, hard to solve efficiently.

Has a wide range of applications

Many open research problems, including
Efficient worst-case computation.
Lower bounds on computation.
Lifting the niceness assumptions. E.g. what
happens to complexity when the data is dense and
correlated?
Skyline of tuples with small ranges. E.g. boolean
values.

28
References
Thank You!

Torlone and Ciaccia. Which Are My Preferred
Items? Workshop on Recommendation and
Personalization in eCommerce (RPEC), pp. 1-9.
Malaga, Spain (2002)
Godfrey. Skyline Cardinality for Relational
Processing. Proceedings of the 3rd International
Symposium on Foundations of Information and
Knowledge Systems, pp. 78-97. Springer,
Wilhelminenberg Castle, Austria (2004).
Chomicki, Godfrey, Gryz, Liang. Skyline with
Presorting. Proceedings of the 19th International
Conference on Data Engineering, pp. 717-719
(2003)
Godfrey, Shipley, Gryz. Maximal Vector
Computation in Large Data Sets. Proceedings of
the 31st International Conference on Very Large
Data Bases. pp 229-240. ACM, Trondheim, Norway
(2005).
Lee, Zheng, Li and Lee. Approaching the Skyline
in Z Order. 33rd International Conference on Very
Large Data Bases, pp 280-290, University of
Vienna, Austria 2007