Title: Sample Database
1Introduction to Skyline Query Processing
- Romil Jain
2Introduction
Query Find hotels that have good balance of
stars, distance, price
3Formal Definition
In simple words, a Skyline is the set of all
non-dominated tuples
T set of tuples n Tuples k dimensions (or
columns or attributes) for each tuple
Skyline S of T is s ? T ? t ? T, ? i ? 1..k
ti ? si and ? i ? 1..k ti ? si Where, ti
and si are the values of the ith column of tuples
t s respectively.
4Example
How many of you compared every tuple with every
other tuple?
- Database (1,2,3,4), (1,2,3,3), (1,2,3,4),
(2,3,3,1) - T1 T2 T3 T4
T1 dominates T2 T1 does not dominate T3 T1
and T4 are incomparable ? remember this for
later!
What is the Skyline ?
Answer (1,2,3,4), (1,2,3,4), (2,3,3,1) T1
T3 T4
5Brute Force Algorithm
- Compare every tuple with all other tuples on all
k columns - Return tuples that are not dominated by any tuple
- Clearly ?(n2)
Can we do better (at least in few cases)?
One obvious way is to stop early i.e stop
comparing as soon as the current tuple gets
dominated. ?(n2)
6Make Life Easier
T1 ? T2 means T1 dominates T2 T1 ? T2 means
T1 and T2 incomparable
Transitivity If T1 ? T2 and T2 ? T3 then T1 ? T3
Use this property to reduce comparisons
7BEST Algorithm (Torlone et. al.1
6) Report the Tj as a skyline member
8BEST Algorithm (cont)
- After first round we have fewer tuples left
- Keep repeating BEST till all remaining tuples
have been exhausted
We reduce comparisons by eliminating tuples
Running time ?
Best case (only 1 maximal) ? (kn)
Worst case (All tuples are equal) ? (kn2)
Average case ? (kmn)
Where m maximals
9What is the Maximals?
We estimate this value by assuming certain nice
properties on the input
- Sparseness No duplicate values over columns
- Independence Tuples are not correlated or
anti-correlated
10What is the Maximals (cont) ?
The estimated value sk,n for k ? 1 and n ? 0
obeys the following recurrence relation
sk,n 1/n (sk-1,n ) sk,n-1
11What is the Maximals (cont) ?
sk,n 1/n (sk-1,n ) sk,n-1
- The Harmonic of n, for n ? 0
12What is the Maximals (cont) ?
sk,n 1/n (sk-1,n ) sk,n-1
It can be shown that Sk,n Hk-1,n
In 2, Godfrey has done a comprehensive study on
maximals.
13Double Divide Conquer Algorithm
3) Maximals in B can never dominate maximals in
A, because the tuples were already sorted along
d0.
4) Call DDC on maximals(A) ? maximals(B) over
reduced dimensions (d1 - dk ).
14Double Divide Conquer Algorithm
Running time ?
Best case (only 1 maximal) ? (kn log(n))
Worst case (E.g. All tuples are equal) ? (n log
k-2(n))
15Worst Cases
- All tuples are equal. maximals n. E.g.
2) Anti-Correlated data. maximals n. E.g.
16Lessons Learnt
- From BEST Eliminating tuples helps to reduce
comparisons (in non-worse cases)
2) From DDC Sorting and reducing the
dimensionality helps in reducing the number of
comparisons.
Further optimization Eliminate tuples as early
as possible.
17Can we sort cleverly?
Assume normalized values (i.e. 0..1) for all
columns. Entropy of tuple t, E(t)
n
?
ti
i1
i.e. Entropy of tuple t is simply the product of
all its normalized column values.
18Can we sort cleverly (contd)?
Sort by entropy (decreasing)
- Observations
- A tuple with higher entropy can never be
dominated by a lower one.
2) Tuples with higher entropies have greater
chance to eliminate tuples with lower entropies.
19Sort Filter Skyline (SFS) Algorithm (Godfrey et.
al.4)
Window in main memory
Write overflows to a file
Entropy-sorted tuple stream
Drop any dominated tuples
After 1st pass use the overflow as the input to
SFS
20SFS Time Complexity
Best case (only 1 maximal) ? (nlog(n) kn)
Worst case (E.g. All tuples are equal) ?(n2)
- Average case ? (nlog(n) kn)
- How?
- Sorting ? ( nlog(n)) ?
- Maximal to maximal comparisons m(m-1)/2 i.e.
o(n) X - Non-maximal to maximal comparison ? (kn) ?
Can we do better than this (at least in non-worse
cases)?
21LESS Algorithm (Godfrey et. al. 5)
Key idea is to eliminate while sorting. In the
first pass
Window in main memory
Write sorted runs to files
EF Window
Unsorted tuple stream
Buffers for Quick Sort
Drop any dominated tuples
After 1st pass merge these sorted runs
22LESS Algorithm (contd)
In the final pass
Window in main memory
Write overflows to files
SF Window
Buffer Pool for sorting
Drop any dominated tuples
If overflows are there, then the next step is
same as SFS
23Time Complexity Comparison
Can we do better than this in worst cases?
24The Zoo of Algorithms
There are many algorithms with their own USPs.
DDC, LDC, FLET, SDC, BNL, BEST, SFS, LESS
LESS is the best Skyline algorithm to our
knowledge
In 3, Godfrey et. al. have done a comprehensive
study on their running times for best, average
worst cases and shown that LESS performs the best.
25Improving the worst-case
Current focus of our research.
Worst case complexity is ? (kn2) i.e. We are
comparing every tuple against every other tuple.
Can we avoid this ?
26Incomparable Tuples
Assume worst case, where tuples are completely
anti-correlated. Still, tuples can be divided
into regions (Lee et. al. 5)
Key idea Do not compare tuples in Region II
Region III
27Conclusion
- Skyline computation is increasingly becoming a
hot research field in Database Community.
- Interesting theoretical problem Easy to
understand, hard to solve efficiently.
- Has a wide range of applications
- Many open research problems, including
- Efficient worst-case computation.
- Lower bounds on computation.
- Lifting the niceness assumptions. E.g. what
happens to complexity when the data is dense and
correlated? - Skyline of tuples with small ranges. E.g. boolean
values.
28References
Thank You!
- Torlone and Ciaccia. Which Are My Preferred
Items? Workshop on Recommendation and
Personalization in eCommerce (RPEC), pp. 1-9.
Malaga, Spain (2002) - Godfrey. Skyline Cardinality for Relational
Processing. Proceedings of the 3rd International
Symposium on Foundations of Information and
Knowledge Systems, pp. 78-97. Springer,
Wilhelminenberg Castle, Austria (2004). - Chomicki, Godfrey, Gryz, Liang. Skyline with
Presorting. Proceedings of the 19th International
Conference on Data Engineering, pp. 717-719
(2003) - Godfrey, Shipley, Gryz. Maximal Vector
Computation in Large Data Sets. Proceedings of
the 31st International Conference on Very Large
Data Bases. pp 229-240. ACM, Trondheim, Norway
(2005). - Lee, Zheng, Li and Lee. Approaching the Skyline
in Z Order. 33rd International Conference on Very
Large Data Bases, pp 280-290, University of
Vienna, Austria 2007