SUBSKY: Efficient Computation of Skylines in Subspaces - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

SUBSKY: Efficient Computation of Skylines in Subspaces

Description:

foul rate. turnover rate. 0. 1. 1. For the NBA database, Low turnover rate and low foul rate are two important factors for a defense player ... – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 34

Provided by: iCs8

Category:

more less

Transcript and Presenter's Notes

Title: SUBSKY: Efficient Computation of Skylines in Subspaces

1
SUBSKYEfficient Computation of Skylines in
Subspaces

Authors Yufei Tao, Xiaokui Xiao, and Jian Pei
Conference ICDE 2006
Presenter Kamiru
Superviosr Dr. Nikos Mamoulis

2
Skyline Queries

Given a set of d-dimenional points, a point p
dominates another p if
piltpi, for all i in d,
and pjltpj, for any j in d
Skyline queries aim to find the points that are
not dominated by any point

turnover rate
1
For the NBA database, Low turnover rate and low
foul rate are two important factors for a defense
player
Best point
foul rate
0
1
player
3
Applications of Skyline Queries

Find a good hotel to me according to distance and
price

Hotel D must not a good hotel for this user,
since its price is higher and distance is farther
than other hotels
price 2000
price 500
D
C
B
price 1000
A
price 1500
4
Alternative applications of Skyline Queries - i

Some top-k queries are calculated by Skyline
queries
A top-k query retrieves the k tuples in P with
highest scores according to g
where g must be a monotonic function, ex
g(p) p.x p.y

5
Alternative applications of Skyline Queries - i

Please help me to find who are the top-2 NBA
players according to sum of their points and
assists in 2007-2008 season

The values are represented by right-top corner of
each player photo
points
PRUNED
20
Top-2 results must be in top-2 skyband
assists
10
10
0
6
Alternative applications of Skyline Queries - ii

Another interesting measurement is dominating
count (DC)
DC is counted by the number of dominating points
to a query

turnover rate
1
Ex find the top-2 dominating players in the NBA
database according to turnover rate and fould rate
1
1
2
4
0
Best point
foul rate
0
1
player
7
Skyline Computations

Two categories of skyline computations
Computing from scratch (no index)
Relied on index
Computing from scratch (no index)
Advantage
No any pre-computation
Not to update any index when data changed
Drawback
Must calculate from scratch
It must scan the entire data at least once

8
Skyline Computations

Relied on index
Once you built, get to use it many times
Lower query cost is occurred by performing the
search on an appropriate structure
B - tree
R - tree
Since all of us are database people, (I hope) we
prefer method 2 more

9
Related works

Computing from scratch (no index)
Block nested loop
Sort filter skyline
Divide and conquer
Bitmap
Linear elimination sort for skyline

10
Related works

Relied on index
B tree approach
index
R tree approach
Nearest neighbor (NN)
Branch and bound skyline (BBS)
BBS has been proved that is I/O optimal. It
accesses fewer disk pages than any algorithm
based on R-trees

11
Related works

index

Point p adds to list i if p has the smallest
value in dimension i
y
List x p50.1 p60.25 p20.3 p70.6
p5
p7
p8
List y p40.1 p10.2 p30.3 p80.6
p6
p2
1) Ssky p5
p3
2) Ssky p5,p4
p1
p4
x
3) Ssky p5,p4,p1
Best point

All remaining elements in List x are pruned by p1
since both coordinates of p6 is bigger than p1
Due to the same reason, all remaining elements in
List y are pruned by p1 too

12
Related works

Dominant region
M1
M2
M2
N3
p5
p7
N1
N2
N3
N4
p6
p8
N4
p1
p2
p3
p4
p5
p6
p7
p8
M1
p2
N2
p3
N1

HNNp1,p2,N2,M2
p1 is the first NN object from best point
Dominant region of p1 shows in grey color

p4
p1
Best point

p2 is pruned by dominating region
Expand N2

13
SUBSKY

According to NBA database, we have more than 10
different attributes for one player
Skyline queries may be interested in some
attributes only

14
SUBSKY

Build one R-tree and run BBS
BBS is an I/O optimal algorithm based on R-tree
index, but their approaches are optimized for a
fixed set of dimensions
Build R-trees for all elements in the power set
of dimensions
Hugh storage space

15
SUBSKY for uniform data

Anchor point Ac the maximal corner of the data
space having maximum coordinate on all dimensions

maximum value of the coordinate
Ac
f(p1)
f(p)max(1-pi), where i is from 1 to d
y
1
p1
f(p2)
fsky(psky)
fsky(psky)min(1-pskyi), where i is from 1 to d
p2
psky
Pruning region of psky
No any point p satisfying f(p)ltfsky(psky) can
belong to the skyline
x
Best point
1
A similar result exists for the skyline of any
subspace
16
SUBSKY for uniform data

Skyline queries only apply on relevant dimensions
SUB
fsky(psky)min(1-pskyi),
where i is in SUB
Then,
f(p) lt fsky(psky) lt fsky(psky)
No any point p satisfying the above equation can
belong to the skyline

17
SUBSKY for uniform data

Assume that our skyline query is interested in
dimension x and y only
First, we sort the data by f(pi)
p3, p4, p1, p2, p5
Sskyp3, fsky(p3)0.5 min(1-0.5,1-0.3)
U0.5 (largest f value in Ssky)
Sskyp3,p4, fsky(p4)0.1
U0.5
Sskyp1,p4, fsky(p1)0.8
p3 is removed by adding p1, since it is dominated
by p1
U0.8

18
Analysis

Assume that you have 15D uniform distributed
objects with cardinality 100k, and we want to
retrieve the skyline in a subspace SUB containing
any two dimensions.

1
Greater than 90 to find an object in area(?, ?),
where ?0.001, since (1- ?2)100k Therefore, fsky(
p) 1- ? 0.999 The volume evaluates to
0.9991598.5, that is, we only need to access
1.5 of the dataset
?
1
?
19
General SUBSKY

In practical, data are usually clustered
If the data are clustered, then we should expect
that one anchor point cannot give us enough
pruning power

Ac
A1
1
psky
x
Best point
1
20
General SUBSKY

Anchors for different clusters

Ac
A1
psky
s3
cluster s1
A2
s2
s4
x

Two questions
How to find the anchors?
How to assign points to anchors?

21
Finding the Anchors

First, let us see what a perfect anchor of a
point p
If p is assigned to A, then p can be pruned by
any skyline point dominating p

Any point on this line is a perfect anchor of
point p
A3
A2
Major perpendicular plane
A1
p
Anti-dominant region of p
22
Finding the Anchors
Major perpendicular plane

For each point, find the projections to the plane
Ex p1, p2
Partition the projected points into m clusters
using algorithm k-means, and formulate an anchor
for each cluster

p2
p1
p2
a good anchor
p1
23
Finding the Anchors

How to decide an anchor for a cluster?

Blue points are assigned to cluster S. How can we
decide the anchor for S?
A

Obtain point B, whose coordinate on each
dimension equals the lowest coordinate of the
points in S in their original space on this axis

Then, the algorithm computes the smallest square
opposite to B which covers all points in S

24
Assigning Points to Anchors

A naïve way is to assign points to their closest
anchor point in the major perpendicular plane
(projected space)
It is not directly quantifies the benefit of an
assignment

25
Assigning Points to Anchors

In order to assign a point to a good anchor, this
paper introduces a new measurement which name is
effective region (ER)

p
Pruning region of p2
Pruning region of p1
All points in yellow region (ER) can make a
pruning region to Ac that cover p
p1
p2
If ER-volume of p is larger, then p has more
chance to be pruned
ER of p
26
Assigning Points to Anchors
A
p
p
p1
p1
ER of p
p2
p2
ER of p
27
Assigning Points to Anchors

The pruning volume size of a point p to an anchor
point Aj is
?max(0,Aji-L8(p,Aj)),
where i is from 1 to d
Therefore, assign a point p to Aj that produces
the largest pruning volume size

28
Query example

We use the same example in previous slide
Assume that we have two anchors, one is Ac and
the other A is found by K-means (m1)
Ac(1,1,1) and A(1,1,0.8)
First, we calculate the ER volume of each data
point with respect to Ac and A

Unit 10-3
29
Query example

Sorted list by f
Ac p4 p1 p2 p5
A p3

Sskyp4, fsky(p4)0.5
U0.5

Sskyp4, p1, fsky(p1)0.8
U0.8

30
Experiments

3 real datasets NBA, Household, and Color
2 synthetic data (10D)
Uniform
Clustered
10 cluster centroids
For each centroid, it takes N/10 points whose
coordinate on each axis follows a Gaussian
distribution with variance 0.05 and a mean equal
to the corresponding coordinate of the centroid

31
Experiments
32
Experiments
33
Experiments
3D subspaces, full-space dimensionality is 10
3D subspaces, 1 million cardinality
34
Conclusion

The core of SUBSKY is a transformation that
convert multi-dimensional data into 1D values
Show better performance than a I/O optimized
algorithm in the subspace skyline problem
Some continuous monitoring cases are good to
investigate
How to adopt the set of anchor points if data
update rapidly
The f values could be stored in other index
structure to support fast update

35
Assigning Points to Anchors

Therefore, we have two ways to assign points to
the anchors
Assign points to their closest anchor point in
the major perpendicular plane (projected space)
Assign points to their closest anchor point by
ER-volume in original space
The second approach is better than the first in
the major perpendicular plane, because the
ER-volume directly quantifies the benefit of an
assignment