Title: SUBSKY: Efficient Computation of Skylines in Subspaces
1SUBSKYEfficient Computation of Skylines in
Subspaces
- Authors Yufei Tao, Xiaokui Xiao, and Jian Pei
- Conference ICDE 2006
- Presenter Kamiru
- Superviosr Dr. Nikos Mamoulis
2Skyline Queries
- Given a set of d-dimenional points, a point p
dominates another p if - piltpi, for all i in d,
- and pjltpj, for any j in d
- Skyline queries aim to find the points that are
not dominated by any point
turnover rate
1
For the NBA database, Low turnover rate and low
foul rate are two important factors for a defense
player
Best point
foul rate
0
1
player
3Applications of Skyline Queries
- Find a good hotel to me according to distance and
price
Hotel D must not a good hotel for this user,
since its price is higher and distance is farther
than other hotels
price 2000
price 500
D
C
B
price 1000
A
price 1500
4Alternative applications of Skyline Queries - i
- Some top-k queries are calculated by Skyline
queries - A top-k query retrieves the k tuples in P with
highest scores according to g - where g must be a monotonic function, ex
- g(p) p.x p.y
5Alternative applications of Skyline Queries - i
- Please help me to find who are the top-2 NBA
players according to sum of their points and
assists in 2007-2008 season
The values are represented by right-top corner of
each player photo
points
PRUNED
20
Top-2 results must be in top-2 skyband
assists
10
10
0
6Alternative applications of Skyline Queries - ii
- Another interesting measurement is dominating
count (DC) - DC is counted by the number of dominating points
to a query
turnover rate
1
Ex find the top-2 dominating players in the NBA
database according to turnover rate and fould rate
1
1
2
4
0
Best point
foul rate
0
1
player
7Skyline Computations
- Two categories of skyline computations
- Computing from scratch (no index)
- Relied on index
- Computing from scratch (no index)
- Advantage
- No any pre-computation
- Not to update any index when data changed
- Drawback
- Must calculate from scratch
- It must scan the entire data at least once
8Skyline Computations
- Relied on index
- Once you built, get to use it many times
- Lower query cost is occurred by performing the
search on an appropriate structure - B - tree
- R - tree
-
- Since all of us are database people, (I hope) we
prefer method 2 more
9Related works
- Computing from scratch (no index)
- Block nested loop
- Sort filter skyline
- Divide and conquer
- Bitmap
- Linear elimination sort for skyline
10Related works
- Relied on index
- B tree approach
- index
- R tree approach
- Nearest neighbor (NN)
- Branch and bound skyline (BBS)
- BBS has been proved that is I/O optimal. It
accesses fewer disk pages than any algorithm
based on R-trees
11Related works
Point p adds to list i if p has the smallest
value in dimension i
y
List x p50.1 p60.25 p20.3 p70.6
p5
p7
p8
List y p40.1 p10.2 p30.3 p80.6
p6
p2
1) Ssky p5
p3
2) Ssky p5,p4
p1
p4
x
3) Ssky p5,p4,p1
Best point
- All remaining elements in List x are pruned by p1
since both coordinates of p6 is bigger than p1 - Due to the same reason, all remaining elements in
List y are pruned by p1 too
12Related works
Dominant region
M1
M2
M2
N3
p5
p7
N1
N2
N3
N4
p6
p8
N4
p1
p2
p3
p4
p5
p6
p7
p8
M1
p2
N2
p3
N1
- HNNp1,p2,N2,M2
- p1 is the first NN object from best point
- Dominant region of p1 shows in grey color
p4
p1
Best point
- p2 is pruned by dominating region
- Expand N2
13SUBSKY
- According to NBA database, we have more than 10
different attributes for one player - Skyline queries may be interested in some
attributes only
14SUBSKY
- Build one R-tree and run BBS
- BBS is an I/O optimal algorithm based on R-tree
index, but their approaches are optimized for a
fixed set of dimensions - Build R-trees for all elements in the power set
of dimensions - Hugh storage space
15SUBSKY for uniform data
- Anchor point Ac the maximal corner of the data
space having maximum coordinate on all dimensions
maximum value of the coordinate
Ac
f(p1)
f(p)max(1-pi), where i is from 1 to d
y
1
p1
f(p2)
fsky(psky)
fsky(psky)min(1-pskyi), where i is from 1 to d
p2
psky
Pruning region of psky
No any point p satisfying f(p)ltfsky(psky) can
belong to the skyline
x
Best point
1
A similar result exists for the skyline of any
subspace
16SUBSKY for uniform data
- Skyline queries only apply on relevant dimensions
SUB - fsky(psky)min(1-pskyi),
- where i is in SUB
- Then,
- f(p) lt fsky(psky) lt fsky(psky)
- No any point p satisfying the above equation can
belong to the skyline
17SUBSKY for uniform data
- Assume that our skyline query is interested in
dimension x and y only - First, we sort the data by f(pi)
- p3, p4, p1, p2, p5
- Sskyp3, fsky(p3)0.5 min(1-0.5,1-0.3)
- U0.5 (largest f value in Ssky)
- Sskyp3,p4, fsky(p4)0.1
- U0.5
- Sskyp1,p4, fsky(p1)0.8
- p3 is removed by adding p1, since it is dominated
by p1 - U0.8
18Analysis
- Assume that you have 15D uniform distributed
objects with cardinality 100k, and we want to
retrieve the skyline in a subspace SUB containing
any two dimensions.
1
Greater than 90 to find an object in area(?, ?),
where ?0.001, since (1- ?2)100k Therefore, fsky(
p) 1- ? 0.999 The volume evaluates to
0.9991598.5, that is, we only need to access
1.5 of the dataset
?
1
?
19General SUBSKY
- In practical, data are usually clustered
- If the data are clustered, then we should expect
that one anchor point cannot give us enough
pruning power
Ac
A1
1
psky
x
Best point
1
20General SUBSKY
- Anchors for different clusters
Ac
A1
psky
s3
cluster s1
A2
s2
s4
x
- Two questions
- How to find the anchors?
- How to assign points to anchors?
21Finding the Anchors
- First, let us see what a perfect anchor of a
point p - If p is assigned to A, then p can be pruned by
any skyline point dominating p
Any point on this line is a perfect anchor of
point p
A3
A2
Major perpendicular plane
A1
p
Anti-dominant region of p
22Finding the Anchors
Major perpendicular plane
- For each point, find the projections to the plane
- Ex p1, p2
- Partition the projected points into m clusters
using algorithm k-means, and formulate an anchor
for each cluster
p2
p1
p2
a good anchor
p1
23Finding the Anchors
- How to decide an anchor for a cluster?
Blue points are assigned to cluster S. How can we
decide the anchor for S?
A
- Obtain point B, whose coordinate on each
dimension equals the lowest coordinate of the
points in S in their original space on this axis
B
- Then, the algorithm computes the smallest square
opposite to B which covers all points in S
24Assigning Points to Anchors
- A naïve way is to assign points to their closest
anchor point in the major perpendicular plane
(projected space) - It is not directly quantifies the benefit of an
assignment
25Assigning Points to Anchors
- In order to assign a point to a good anchor, this
paper introduces a new measurement which name is
effective region (ER)
p
Pruning region of p2
Pruning region of p1
All points in yellow region (ER) can make a
pruning region to Ac that cover p
p1
p2
If ER-volume of p is larger, then p has more
chance to be pruned
ER of p
26Assigning Points to Anchors
A
p
p
p1
p1
ER of p
p2
p2
ER of p
27Assigning Points to Anchors
- The pruning volume size of a point p to an anchor
point Aj is - ?max(0,Aji-L8(p,Aj)),
- where i is from 1 to d
- Therefore, assign a point p to Aj that produces
the largest pruning volume size
28Query example
- We use the same example in previous slide
- Assume that we have two anchors, one is Ac and
the other A is found by K-means (m1) - Ac(1,1,1) and A(1,1,0.8)
- First, we calculate the ER volume of each data
point with respect to Ac and A
Unit 10-3
29Query example
- Sorted list by f
- Ac p4 p1 p2 p5
- A p3
- Sskyp4, p1, fsky(p1)0.8
- U0.8
30Experiments
- 3 real datasets NBA, Household, and Color
- 2 synthetic data (10D)
- Uniform
- Clustered
- 10 cluster centroids
- For each centroid, it takes N/10 points whose
coordinate on each axis follows a Gaussian
distribution with variance 0.05 and a mean equal
to the corresponding coordinate of the centroid
31Experiments
32Experiments
33Experiments
3D subspaces, full-space dimensionality is 10
3D subspaces, 1 million cardinality
34Conclusion
- The core of SUBSKY is a transformation that
convert multi-dimensional data into 1D values - Show better performance than a I/O optimized
algorithm in the subspace skyline problem - Some continuous monitoring cases are good to
investigate - How to adopt the set of anchor points if data
update rapidly - The f values could be stored in other index
structure to support fast update
35Assigning Points to Anchors
- Therefore, we have two ways to assign points to
the anchors - Assign points to their closest anchor point in
the major perpendicular plane (projected space) - Assign points to their closest anchor point by
ER-volume in original space - The second approach is better than the first in
the major perpendicular plane, because the
ER-volume directly quantifies the benefit of an
assignment