Title: Aggregation Algorithms and Instance Optimality
1Aggregation Algorithms and Instance Optimality
Weizmann Institute
Joint work with Ron Fagin Amnon Lotem
2Aggregating information from several lists/sources
- Define the problem
- Ways to evaluate algorithms
- New algorithms
- Further Research
3The problem
- Database D of N objects
- An object R has m fields - (x1, x2, ?, xm)
- Each xi ? ?0,1?
- The objects are given in m lists L1, L2, ?, Lm
- list Li all objects sorted by xi value.
- An aggregation function t(x1,x2,xm)
- t(x1,x2,xm) - a monotone increasing function
- Wanted top k objects according to t
4Goal
- Touch as few objects as possible
- Access to object?
List L2
List L1
s2 0.85
a2 0.84
r2 0.75
s1 0.65
r1 0.5
b2 0.3
a1 0.4
c2 0.2
5Where?
- Problem arises when combining information from
several sources/criteria - Concentrate on middleware complexity without
changing subsystems
6Example Combining Fuzzy Information
- Lists are results of query find object with
- color ?red and shape ?round
- Subsystems for color and for shape.
- Each returns a score in 0,1 for each object
- Aggregating function t is how the middleware
system should combine the two criteria - Example t(R(x1,x2 )) could be min(x1,x2 )
7Example scheduling pages
- Each object - page in a data broadcast system
- 1st field - ? of users requesting the page
- 2nd field - longest time user is waiting
- Combining function t - product of the two fields
(geometric mean) - Goal find the page with the largest product
8Example Information Retrieval
Documents
Dk
D1
D2
W12
Query T1, T2, T3 find documents with largest sum
of entries
Aggregation function t is ? xi
9Modes of Access to the Lists
- Sequential/sorted access obtain next object in
list Li - cost cS
- Random access for object R and ??i? m obtain
xi - cost cR
- Cost of an execution
- cS ? (? of seq. access) ? cR ? ( ? of random
access)
10Interesting Cases
- cR /cS is small
- cS ? cR or
- cR gtgt cS
- Number of lists m - small
11Fagins Algorithm - FA
- For all lists L1, L2, ?, Lm get next object in
sorted order. - Stop when there is set of k objects that appeared
in all lists. - For every object R encountered
- retrieve all fields x1, x2, ?, xm.
- Compute t(x1,x2,xm)
- Return top k objects
12Correctness of FA...
- For any monotone t and any database D of objects,
FA finds the top k objects. - Proof any object in the real top k is better in
at least one field than the objects in
intersection.
13Performance of FA
- Performance assuming that the fields are
independent ?(N(m-1)/m). - Better performance - correlation between fields
- Worse performance - negative correlation
- Bad aggregating function max
14Goals of this work
- Improve complexity and analysis - worst case
not meaningful - Instead consider Instance Optimality
- Expand the range of functions want to handle
all monotone aggregating functions - Simplify implementation
15Instance Optimality
- A class of algorithms,
- D class of legal inputs.
- For A?A and D?D measure cost(A,D) ?0.
- An algorithm A?A is instance optimal over A and
D if there are constants c1 and c2 s.t. - For every A?A and D?D
- cost(A,D) ? c1 ?cost(A,D) ? c2.
- c1 is called the optimality ratio
16Instance Optimality
- Common in competitive online analysis
- Compare an online decision making algorithm to
the best offline one. - Approximation Algorithms
- Compare the size that the best algorithm can find
to the one the approx. algorithm finds - In our case
- Offline ? Nondeterminism
17Instance Optimality
- We show algorithms that are instance optimal for
a variety of - Classes of algorithms
- deterministic, Probabilistic, Approximate
- Databases
- access cost functions
18Guidelines for Design of Algorithms
- Format do sequential/sorted access (with
random access on other fields) until you know
that you have seen the top k. - In general greedy gathering of information If a
query might allow you to know top k objects do
it. - Works in all considered scenarios
19The Threshold Algorithm - TA
- For all lists L1, L2, ?, Lm get next object in
sorted order. - For each object R returned
- Retrieve all fields x1,x2,?,xm.
- Compute t(x1,x2,xm)
- If one of top k answers so far - remember it.
- ?1?i?m let xi be bottom value seen in Li (so far)
- Define the threshold value ? to be
t(x1,x2,xm) - Stop when found k objects with t value ? ?.
- Return top k objects
20Example m2, k1, t is min
b , t(b) 1/11
c , t(c) 1/12
- Top object (so far)
- Bottom values x1 x2
- Threshold t
r , t(r) 1/8
Maintained Information
0.9
0.7
0.4
0.1
3/4
2/3
1/2
1/4
0.4
0.1
2/3
3/4
s (0.05,3/4)
c (0.9, 1/12)
w (0.07, 2/3)
b (0.7, 1/11)
z (0.09, 1/2)
r (0.4, 1/8)
q (0.08, 1/4)
a (0.1, 1/13)
21Correctness of TA
- For any monotone t and any database D of objects,
TA finds the top k objects. - Proof If object z was not seen
- ? ?1?i?m zi ? xi
- ? t(z1, z2,zm) ? t(x1,x2,xm) ? ?
22Implementation of TA
- Requires only bounded buffers
- Top k objects
- Bottom m values x1,x2,xm
23Robustness of TA
- Approximation Suppose want an (1??) approx.
- - for any R returned and R not returned
- t(R) ? (1??) t(R)
- Modified stopping condition
- Stop when found k objects with t value at least
?/(1??). - Early Stopping can modify TA so that at any
point user is - Given current view of top k list
- Given a guarantee about ? approximation
24Instance Optimality
- Intuition Cannot stop any sooner, since the next
object to be explored might have the threshold
value. - But, life is a bit more delicate...
25Wild Guesses
- Wild guesses random access for a field i of
object R that has not been sequentially accessed
before - Neither FA nor TA use wild guesses
- Subsystem might not allow wild guesses
- More exotic queries jth position in ith list...
26Instance Optimality- No Wild Guesses
- Theorem For any monotone t let
- A be the class of algorithms that
- correctly find top k answers for every database
with aggregation function t. - Do not make wild guesses
- D be the class of all databases.
- Then TA is instance optimal over A and D
- Optimality ratio is mm2 cR/cS - best possible!
27Proof of Optimality
- Claim If TA gets to iteration d, then any
(correct) algorithm A must get to depth d-1 - Proof let Rmax be top object returned by TA
- ?(d) ? t(Rmax) ? ?(d-1)
- ?There exists D with R at level d-1
- R ?(x1(d-1), x2 (d-1),xm(d-1) )
- Where A fails
28Do wild guesses help?
- Aggregation function - min, k1
- Database - 1 2 n n?1 2n?1
- 1 1 1 1 0 0 0
- 0 0 0 1 1 1 1
- L1 1 2 n n?1 2n?1
- L2 2n?1 n?1 n 1
- Wild guess access object n?1 and top elements
29Strict Monotonicity
- An aggregation function t is strictly monotone if
- when ?1?i?m xi ? xi
- Then
- t(x1, x2,xm) ? t(x1,x2,xm)
- Examples min, max, avg...
30Instance Optimality - Wild Guesses
- Theorem For any strictly monotone t let
- A be the class of algorithms that
- correctly find top k answers for every database.
- D be the class of all databases with distinct
values in each field. - Then TA is instance optimal over A and D
- Optimality Ratio is c m where
cmaxcR /cS ,cS /cR
31Related Work
- An algorithm similar to TA was discovered
independently by two other groups - Nepal and Ramakrishna
- G?ntzer, Balke and Kiessling
- No instance optimality analysis
- Hence proposed modifications that are not
instance optimal algorithm
Power of Abstraction?
32Dealing with the Cost of Random Access
- In some scenarios random access may be impossible
- Cannot ask a major search engine for it internal
score on some document - In some scenarios random access may be expensive
- Cost corresponds to disk access (seq. vs. random)
- Need algorithms to deal with these scenarios
- NRA - No Random Access
- CA - Combined Algorithm
33No Random Access - NRA
- March down the lists getting the next object
- Maintain
- For any object R with discovered fields
S??1,..,m? - W(R) ? t(x1,x2,,xS,,00)
- Worst (smallest) value t(R) can obtain
- B(R) ? t(x1,x2,,xS, xS1,, , xm)
- Best (largest) value t(R) can obtain
34maintained information (NRA)
- Top k list, based on k largest W(R) seen so far
- Ties broken according to B values
- Define Mk to be the kth largest W(R) in top k
list - An object R is viable if B(R) ? Mk
- Stop when there are no viable elements left I.e.
B(R) ? Mk for all R ?top list - Return the top k list
35Correctness
- For any monotone t and any database D of objects,
NRA finds the top k objects. - Proof At any point, for all objects t(R)?B(R)
- Once B(R) ? Ck for all but top list
- ? no other objects with t(R) ? Ck
36Optimality
- Theorem For any monotone t let
- A be the class of algorithms that
- correctly find top k answers for every database.
- make only sequential access
- D be the class of all databases.
- Then NRA is instance optimal over A and D
- Optimality Ratio is m !
37Implementation of NRA
- Not so simple - need to update B(R) for all
existing R when x1,x2,xm changes - For specific aggregation functions (min) good
data structures - Open Problem Which aggregation function have
good data structures?
38Combined Algorithm CA
- Can combine TA and NRA
- Let h cR /cS
- Maintain information as in NRA
- For every h sequential accesses
- Do m random access on an objects from each list.
Choose top viable for which not all fields are
known
39Instance Optimality
- Instance optimality statement a bit more complex
- Under certain assumptions (including t min,
sum) CA is instance optimal with optimality ratio
2m
40Further Research
- Middleware Scenario
- Better implementations of NRA
- Is large storage essential
- Additional useful information in each list?
- How widely applicable is instance optimality?
- String Matching, Stable Marriage...
- Aggregation functions and methods in other
scenarios - Rank Aggregation of Search Engines
- PNP?
41More Details
- See
- www.wisdom.weizmann.ac.il/naor/PAPERS/middle_agg.
html