Title: Probabilistic Skyline Operator over Sliding Windows
1- Probabilistic Skyline Operator over Sliding
Windows
Wenjie Zhang University of New South Wales
NICTA, Australia
Joint work Xuemin Lin, Ying Zhang, Wei Wang
(UNSW NICTA) Jeffrey Xu Yu (CUHK)
2Outline
- Background
- Framework
- Algorithms
- Experiment
- Conclusion
3Background
2
0.1
1
1
0.1
4
0.8
6
0.5
3
0.4
5
0.1
- Elements continuously arrive with occurrence
probabilities
- Problem How to continuously compute skylines in
a sliding window with size N (elements)?
4Background
- Multi-criteria decision making regarding
uncertain data - Online auction
- Financial market
-
5Related work
- Probabilistic skyline (VLDB07)
- Probabilistic reverse skyline (SIGMOD08)
- Probabilistic aggregates and sketches over
uncertain streams (SIGMOD07, SODA07, PODS07) - Frequent items on uncertain streams (SIGMOD08)
- Top-k queries over uncertain sliding window
(VLDB08) -
Probabilistic skyline computation
Uncertain stream processing
6Models and Problem Definition
- Model DS is a stream of elements, each element a
is in a d-dimensional space and with an
occurrence probability P(a) ( in (0, 1) - The skyline probability of an element a is
- Problem Definition retrieving elements from the
most recent N elements, with skyline probability
no less than a given threshold q -
7Challenges and Contributions
- Space efficiency
- Contribution Space reduction O(N) to O(lnd-1N)
- Time efficiency
- Contribution R-tree based efficient incremental
algorithms
8Outline
- Background and Preliminaries
- Framework
- Algorithms
- Experiment
- Conclusion
9Framework what to keep ?
Pold (2) 1 P(1)
2
0.1
0.1
1
Pnew(2) (1 P(3)) (1 P(4))
4
0.8
3
0.4
Pnew (2) lt q , element 2 will never become
skyline in the window
5
0.1
- window size N 5 probability threshold 0.5
10Framework what to keep ?
- Candidate set SN,q
- Correctness
- (1) no missing skyline points
- (2) no false hits to determine SN, q
- (3) no false positive to determine skyline
results - (4) no false negative to determine skyline
results - --- probability based on SN,q may not be
accurate, but - satisfies the threshold
requirement. -
11Framework
- Space required for SN,q
- SN,q is the minimum information to be maintained
to get a correct answer.
Psky(3) 0.9 (1 0.4) (1- 0.3) lt q
Psky(3) 0.9 gt q
3
0.9
0.4
2
2
0.3
1
1
4
0.8
window size N 4 probability threshold q 0.5
12Space of Candidate Set
- Theorem Candidate Set requires a
poly-logarithmic space on average case regarding
uniform distributions, O(f(q)lnd-1N).
13Outline
- Background and Preliminaries
- Framework
- Algorithms
- Experiment
- Conclusion
14Algorithms
- We maintain two R-trees
- R1 SKYN,q --- skylines
- R2 SN,q - SKYN,q --- candidates skylines
15Algorithms
R1 SKYN,q
not in SN,q
1(.1)
6(.8)
8(.2)
5(.8)
10(.2)
7(.6)
3(.4)
9(.5)
11(.6)
R2 SN,q SKYN,q
13(.1)
12(.1)
2(.1)
4(.1)
- window size N 13 probability threshold q 0.2
16Algorithms
- New element arrives
- Check Psky Pnew on R1
- Check Pnew on R2
- Handling elements with Pnew lt q
- Old element expires
- Update Pold
- Check Psky on R2
17Algorithms new elements arrives
R1 SKYN,q
Delete an Entry
6(.8)
8(.2)
5(.8)
Before update Pnew (1, 1) Psky (0.8,
0.8) global Pnew 1 0.2 After update global
Pnew 1- 0.8 Delete from R1
10(.2)
7(.6)
3(.4)
9(.5)
11(.6)
R2 SN,q - SKYN,q
13(.1)
12(.1)
2(.1)
4(.1)
14(0.8)
- window size N 13 probability threshold q 0.2
18Algorithms new elements arrives
Move an Entry from R1 to R2
R1 SKYN,q
8(.2)
Before update Pnew (1, 1) Psky (0.24,
0.6) global Pnew 1 After update global Pnew
1 0.8 min Pnew 0.2 q max Psky 0.12 lt
q Move from R1 to R2
10(.2)
7(.6)
3(.4)
9(.5)
11(.6)
R2 SN,q - SKYN,q
13(.1)
12(.1)
2(.1)
4(.1)
14(0.8)
- window size N 13 probability threshold q 0.2
19Algorithms new elements arrives
R1 SKYN,q
8(.2)
R2 SN,q - SKYN,q
10(.2)
Before update Pnew (0.9, 1) global Pnew
1 After update global Pnew 1 0.8 min Pnew
lt q max Pnew q Drill down and delete 2
7(.6)
3(.4)
9(.5)
11(.6)
13(.1)
12(.1)
2(.1)
4(.1)
14(0.8)
- window size N 13 probability threshold q 0.2
20Algorithms new elements arrives
R1 SKYN,q
8(.2)
R2 SN,q - SKYN,q
10(.2)
Update Pold
7(.6)
3(.4)
Update Pold of 12 13 global Pold / (1 0.1)
9(.5)
11(.6)
13(.1)
12(.1)
2(.1)
4(.1)
14(0.8)
- window size N 13 probability threshold q 0.2
21Algorithms new elements arrives
R1 SKYN,q
8(.2)
R2 SN,q - SKYN,q
10(.2)
7(.6)
Insert new element Pnew 1. compute Psky
3(.4)
9(.5)
11(.6)
13(.1)
12(.1)
4(.1)
14(0.8)
- window size N 13 probability threshold q 0.2
22Algorithm old element expires
- Delete it from R1 or R2.
- Update Pold of remaining elements
- Record global Pold on intermediate entries fully
dominated by it - Check Psky after update
23Algorithms old element expires
R1 SKYN,q
8(.2)
Pold (7) / 1 P(3)
10(.2)
R2 SKYN,q
7(.6)
3(.4)
9(.5)
11(.6)
13(.1)
12(.1)
4(.1)
global Pold / 1 P(4)
14(0.8)
- window size N 13 probability threshold q 0.2
24Algorithms handling multiple thresholds
- Continuous queries
- Users specify k probability thresholds q1, , qk.
(qi lt qi-1) - Solution instead of maintaining R1, we maintain
R1, , Rk, each corresponding to a confidence
value. - Ad-hoc queries
- Users issue a query retrieve skylines with
probability at least q (q qk) - Solution find an Ri with qi q lt qi-1. Then
all elements in Rj j lt i -1 are results. We
search Ri-1 to output qualified skylines
25Experiment
- Data set
- Real stock transactions. 2-d. probability
assigned randomly. Size 2 million - Synthetic spatial location (independent or
anti-correlated) probability (uniform or
normal) 2d to 5d 2 million - Default values p 0.3 d 3 N 1M spatial
distribution anti-correlated probability
uniform
26Experiment space
- 0.1 to the sliding window size for 2-d data
save around 89 space even for 5-d data.
27Experiment space
- Size of SN,q deceases with the increase of Pu,
while size of SKYN,q increases with it.
28Experiment space
29Experiment time
30Experiment time
- Maintenance time increases with probability
thresholds query time deceases with it.
31Conclusion
- We characterize a candidate set with minimum size
and propose time efficient techniques. - We extend the framework to handle multiple
thresholds.
32