Title: Placing Skips Optimally in Expectation
1Placing Skips Optimally in Expectation
- Flavio Chierichetti,
- Silvio Lattanzi,
- Federico Mari
- Alessandro Panconesi
- Supported by
2Problem Statement
3Answering conjunctive queries
query latte macchiato
Latte
3
7
9
14
19
23
41
47
Macchiato
2
4
7
11
19
41
57
62
4Answering conjunctive queries
query latte macchiato
Latte
3
7
9
14
19
23
41
47
Macchiato
2
4
7
11
19
41
57
62
Compute the intersection of 2 sorted lists
5Merging
Latte
3
7
9
14
19
23
41
47
Macchiato
2
4
7
9
19
41
57
62
6Merging
Latte
3
7
9
14
19
23
41
47
Macchiato
2
4
7
9
19
41
57
62
7Merging
Latte
3
7
9
14
19
23
41
47
Macchiato
2
4
7
9
19
41
57
62
8Merging
Latte
3
7
9
14
19
23
41
47
Macchiato
2
4
7
9
19
41
57
62
9Merging
Latte
3
7
9
14
19
23
41
47
Macchiato
2
4
7
9
19
41
57
62
10Skips
Latte
1
2
3
4
5
6
7
8
Macchiato
7
8
17
18
19
41
57
62
11Skips
Latte
1
2
3
4
5
6
7
8
Macchiato
7
8
17
18
19
41
57
62
12Skips
Latte
1
2
3
4
5
6
7
8
Macchiato
7
8
17
18
19
41
57
62
13Conventional WSDM
t
Skips are placed every ?N many positions
14Question
- If we know the query distribution, can we place
skips better?
15Problem statement
- If we know the query distribution, can we place
skips in order to minimize the expected time of a
merge?
16Problem statement
- If we know the query distribution, can we place
skips in order to minimize the expected time of a
merge?
Is the assumption realistic?
17The Power of the Law
18The query distribution contains a lot of
information. Can we provably take advantage of it?
19Algorithms to follow work with any
query distribution whatsoever
20Algorithms to follow work with any
query distribution whatsoever
..and can be extended to deal with soft
conjunctions
21Outline
- Skip placement policies
- A matter of definitions
- Algorithms
- Experiments
22Skip Placement Policies
23Spaghetti Skips
24Spaghetti Skips
t
25Simple Skips
t
26Simple Skips
t
This is the most interesting case
27A Matter of Definitions
28Useful Documents 1
q world cup
world
3
7
9
14
19
23
41
47
cup
2
4
7
9
19
41
57
62
Relevant docs are useful
29But usefulness does not coincide with relevance
30Useful Documents 2
q world cup
world
14
15
?
18
47
?
?
?
cup
13
19
41
43
62
?
?
?
Is the skip useful for q?
31Useful Documents 2
q world cup
world
14
15
?
18
47
?
?
?
cup
13
19
41
43
62
?
?
?
Is the skip useful for q?
32Useful Documents 2
q world cup
world
14
15
?
18
47
?
?
?
cup
13
19
41
43
62
?
?
?
Is the skip useful for q?
33Useful Documents 2
q world cup
world
14
15
?
18
47
?
?
?
cup
13
19
41
43
62
?
?
?
The skip is useful
34Useful Documents 2
q world cup
world
14
15
?
?
47
?
?
?
cup
13
19
41
43
62
?
?
?
35Useful Documents 2
q world cup
world
14
15
?
?
47
?
?
?
cup
13
19
41
43
62
?
?
?
36Useful Documents 2
q world cup
world
14
15
16
?
47
?
?
?
cup
13
19
41
43
62
?
?
?
37Useful Documents 2
q world cup
world
14
15
16
18
47
?
?
?
cup
13
19
41
43
62
?
?
?
38Useful Documents 2
q world cup
world
14
15
16
18
47
?
?
?
cup
13
19
41
43
62
?
?
?
39Useful Documents 2
q world cup
world
14
15
16
18
47
?
?
?
cup
13
19
41
43
62
?
?
?
40Useful Documents 2
q world cup
world
14
15
16
18
47
?
?
?
cup
13
19
41
43
62
?
?
?
The skip is useless
41Useful Documents 2
18 cannot be skipped
q world cup
world
14
15
16
18
47
?
?
?
cup
13
19
41
43
62
?
?
?
The skip is useless
42Useful Documents 2
Useful documents are those that cannot be
avoided during a merge
43Induced Distributions
- The query distribution induces another
distribution on the postings
platypus
1
2
3
i
j
k
n
p1
p2
pi
pn
44Induced Distributions
- The query distribution induces another
distribution on the postings
platypus
1
2
3
i
j
k
n
p1
p2
pi
pn
pi Pr(i useful for q platypus ? q)
45Induced Distributions
- The query distribution induces another
distribution on the postings
platypus
1
2
3
i
j
k
n
p1
p2
pi
pn
We will assume this distribution to be known
46Induced Distributions
- In practice these probabilities can be
approximated using a small sample of the query
universe
47Making Life Simple
- Events like a is useful and b is useful are
not independent - ..but from now on we will assume that they are
48Making Life Simple
- Events like a is useful and b is useful are
not independent - ..but from now on we will assume that they are
This simplifying assumption will be vindicated by
our experiments
49Algorithms
50Algorithms
- Input a list with, for each doc, the probability
that it is useful - Output skip placement that minimizes the
expected time to merge
51Algorithms
- Input a list with, for each doc, the probability
that it is useful - Output skip placement that minimizes the
expected time to merge
cost of a merge elements read in posting list
52Algorithms
- O(nt) algorithm for spaghetti skips, where t is
the average length of a skip - O(n log n) for simple skips
53Algorithms
- O(nt) algorithm for spaghetti skips, where t is
the average length of a skip - O(n log n) for simple skips
O(n log n) algorithm for simple skips is by
far the most interesting
54Simple Skips
- t d1d2..di..dn p1p2..pn
- Build the solution from left to right
- M(i) is best placement for prefix d1..di
55Computing M(i)
In computing M(i) we have two choices. We either
place a skip landing at position i or we do not
56Computing M(i)
M(i-1)
i
If we place no skip to i then M(i) M(i-1)
57Computing M(i)
M(j)
G(j,i)
j
i
58Computing M(i)
M(j)
G(j,i)
j
i
Want this in O(log n)
maxj M(j) G(j, i)
M(i) max M(i-1), maxj M(j) G(j, i)
59Computing M(i)
M(T(i))
G(T(i),i)
T(i)
i
maxj M(j) G(j, i) M(T(i)) G(T(i),j)
60Monotonicity of T(i)
T(i)
i
i1
61Monotonicity of T(i)
T(i)
i
T(i1)
i1
T(i) T(i1)
62Monotonicity of T(i,k)
k
i
T(i,k) is best jump to i under the
additional constraint that it must start no later
than k
63Monotonicity of T(i,k)
k
i
Key lemma T(i,k) T(i1,k)
64Monotonicity of T(i,k)
- Let î be the smallest index i such that T(i,k)k.
Then,
k j î
T(j,k)
T(j,k-1) j lt î
65Updating T(i,k)
T(i,k-1)
1
1
1
1
1
i1
i1
i1
i1
i2
i2
i2
i2
i3
i3
i3
i4
i4
i4
i4
66Updating T(i,k)
T(i,k-1)
1
1
1
1
1
i1
i1
i1
i1
i2
i2
i2
i2
i3
i3
i3
i4
i4
i4
i4
1 lt i1 lt i2 lt i3 lt i4 lt k-1
67Updating T(i,k)
j
T(i,k-1)
1
1
1
1
1
i1
i1
i1
i1
i2
i2
i2
i2
i3
i3
i3
i4
i4
i4
i4
The best skip to reach j starts at position i1
68Updating T(i,k)
T(i,k-1)
1
1
1
1
1
i1
i1
i1
i1
i2
i2
i2
i2
i3
i3
i3
i4
i4
i4
i4
T(i,n) gives the optimal placement
69Updating T(i,k)
T(i,k-1)
1
1
1
1
1
i1
i1
i1
i1
i2
i2
i2
i2
i3
i3
i3
i4
i4
i4
i4
T(i,n) gives the optimal placement
T(,1) ? T(,2) ? ? T(,k) ? ? T(,n)
70Updating T(i,k)
min i T(i,k)k
T(i,k-1)
1
1
1
1
1
i1
i1
i1
i1
i2
i2
i2
i2
i3
i3
i3
i4
i4
i4
i4
71Updating T(i,k)
min i T(i,k)k
T(i,k-1)
1
1
1
1
1
i1
i1
i1
i1
i2
i2
i2
i2
i3
i3
i3
i4
i4
i4
i4
T(i,k)
1
1
1
1
1
i1
i1
i1
i1
i2
i2
k
k
k
k
k
k
k
k
k
72The resulting algorithm takes O(N logN) where N
is the length of the list
73Experiments
74Space
75Time to merge
76Build up time
77Size of query sample for spaghetti skips
78Size of query sample for simple skips
1/256
79The Bottomline
- Simple skips are the solution of choice (for
power law distributions) - They merge as fast as spaghetti skips (the
general case) - They occupy less space
- Build time is much faster
- They need a smaller sample to collect statistics
on document usefulness
80Summing up
- First attempt to exploit in a rigorous way
knowledge of the distribution - Much work remains to be done but results are
encouraging
81Extensions
- Taking the cache into account
- Taking dependencies into account
- Compare against skip list and other data
structures
Thanks for your attention