Aggregation Algorithms and Instance Optimality - PowerPoint PPT Presentation

About This Presentation
Title:

Aggregation Algorithms and Instance Optimality

Description:

An object R has m fields - (x1, x2, , xm) Each xi 0,1 ... Define the threshold value to be t(x1,x2,...xm) Stop when found k objects with t value ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 42
Provided by: weizmannin9
Category:

less

Transcript and Presenter's Notes

Title: Aggregation Algorithms and Instance Optimality


1
Aggregation Algorithms and Instance Optimality
  • Moni Naor

Weizmann Institute
Joint work with Ron Fagin Amnon Lotem
2
Aggregating information from several lists/sources
  • Define the problem
  • Ways to evaluate algorithms
  • New algorithms
  • Further Research

3
The problem
  • Database D of N objects
  • An object R has m fields - (x1, x2, ?, xm)
  • Each xi ? ?0,1?
  • The objects are given in m lists L1, L2, ?, Lm
  • list Li all objects sorted by xi value.
  • An aggregation function t(x1,x2,xm)
  • t(x1,x2,xm) - a monotone increasing function
  • Wanted top k objects according to t

4
Goal
  • Touch as few objects as possible
  • Access to object?

List L2
List L1
s2 0.85
a2 0.84
r2 0.75
s1 0.65
r1 0.5
b2 0.3
a1 0.4
c2 0.2
5
Where?
  • Problem arises when combining information from
    several sources/criteria
  • Concentrate on middleware complexity without
    changing subsystems

6
Example Combining Fuzzy Information
  • Lists are results of query find object with
  • color ?red and shape ?round
  • Subsystems for color and for shape.
  • Each returns a score in 0,1 for each object
  • Aggregating function t is how the middleware
    system should combine the two criteria
  • Example t(R(x1,x2 )) could be min(x1,x2 )

7
Example scheduling pages
  • Each object - page in a data broadcast system
  • 1st field - ? of users requesting the page
  • 2nd field - longest time user is waiting
  • Combining function t - product of the two fields
    (geometric mean)
  • Goal find the page with the largest product

8
Example Information Retrieval
Documents
Dk
D1
D2
W12
Query T1, T2, T3 find documents with largest sum
of entries
Aggregation function t is ? xi
9
Modes of Access to the Lists
  • Sequential/sorted access obtain next object in
    list Li
  • cost cS
  • Random access for object R and ??i? m obtain
    xi
  • cost cR
  • Cost of an execution
  • cS ? (? of seq. access) ? cR ? ( ? of random
    access)

10
Interesting Cases
  • cR /cS is small
  • cS ? cR or
  • cR gtgt cS
  • Number of lists m - small

11
Fagins Algorithm - FA
  • For all lists L1, L2, ?, Lm get next object in
    sorted order.
  • Stop when there is set of k objects that appeared
    in all lists.
  • For every object R encountered
  • retrieve all fields x1, x2, ?, xm.
  • Compute t(x1,x2,xm)
  • Return top k objects

12
Correctness of FA...
  • For any monotone t and any database D of objects,
    FA finds the top k objects.
  • Proof any object in the real top k is better in
    at least one field than the objects in
    intersection.

13
Performance of FA
  • Performance assuming that the fields are
    independent ?(N(m-1)/m).
  • Better performance - correlation between fields
  • Worse performance - negative correlation
  • Bad aggregating function max

14
Goals of this work
  • Improve complexity and analysis - worst case
    not meaningful
  • Instead consider Instance Optimality
  • Expand the range of functions want to handle
    all monotone aggregating functions
  • Simplify implementation

15
Instance Optimality
  • A class of algorithms,
  • D class of legal inputs.
  • For A?A and D?D measure cost(A,D) ?0.
  • An algorithm A?A is instance optimal over A and
    D if there are constants c1 and c2 s.t.
  • For every A?A and D?D
  • cost(A,D) ? c1 ?cost(A,D) ? c2.
  • c1 is called the optimality ratio

16
Instance Optimality
  • Common in competitive online analysis
  • Compare an online decision making algorithm to
    the best offline one.
  • Approximation Algorithms
  • Compare the size that the best algorithm can find
    to the one the approx. algorithm finds
  • In our case
  • Offline ? Nondeterminism

17
Instance Optimality
  • We show algorithms that are instance optimal for
    a variety of
  • Classes of algorithms
  • deterministic, Probabilistic, Approximate
  • Databases
  • access cost functions

18
Guidelines for Design of Algorithms
  • Format do sequential/sorted access (with
    random access on other fields) until you know
    that you have seen the top k.
  • In general greedy gathering of information If a
    query might allow you to know top k objects do
    it.
  • Works in all considered scenarios

19
The Threshold Algorithm - TA
  • For all lists L1, L2, ?, Lm get next object in
    sorted order.
  • For each object R returned
  • Retrieve all fields x1,x2,?,xm.
  • Compute t(x1,x2,xm)
  • If one of top k answers so far - remember it.
  • ?1?i?m let xi be bottom value seen in Li (so far)
  • Define the threshold value ? to be
    t(x1,x2,xm)
  • Stop when found k objects with t value ? ?.
  • Return top k objects

20
Example m2, k1, t is min
b , t(b) 1/11
c , t(c) 1/12
  • Top object (so far)
  • Bottom values x1 x2
  • Threshold t

r , t(r) 1/8
Maintained Information
0.9
0.7
0.4
0.1
3/4
2/3
1/2
1/4
0.4
0.1
2/3
3/4
s (0.05,3/4)
c (0.9, 1/12)
w (0.07, 2/3)
b (0.7, 1/11)
z (0.09, 1/2)
r (0.4, 1/8)
q (0.08, 1/4)
a (0.1, 1/13)
21
Correctness of TA
  • For any monotone t and any database D of objects,
    TA finds the top k objects.
  • Proof If object z was not seen
  • ? ?1?i?m zi ? xi
  • ? t(z1, z2,zm) ? t(x1,x2,xm) ? ?

22
Implementation of TA
  • Requires only bounded buffers
  • Top k objects
  • Bottom m values x1,x2,xm

23
Robustness of TA
  • Approximation Suppose want an (1??) approx.
  • - for any R returned and R not returned
  • t(R) ? (1??) t(R)
  • Modified stopping condition
  • Stop when found k objects with t value at least
    ?/(1??).
  • Early Stopping can modify TA so that at any
    point user is
  • Given current view of top k list
  • Given a guarantee about ? approximation

24
Instance Optimality
  • Intuition Cannot stop any sooner, since the next
    object to be explored might have the threshold
    value.
  • But, life is a bit more delicate...

25
Wild Guesses
  • Wild guesses random access for a field i of
    object R that has not been sequentially accessed
    before
  • Neither FA nor TA use wild guesses
  • Subsystem might not allow wild guesses
  • More exotic queries jth position in ith list...

26
Instance Optimality- No Wild Guesses
  • Theorem For any monotone t let
  • A be the class of algorithms that
  • correctly find top k answers for every database
    with aggregation function t.
  • Do not make wild guesses
  • D be the class of all databases.
  • Then TA is instance optimal over A and D
  • Optimality ratio is mm2 cR/cS - best possible!

27
Proof of Optimality
  • Claim If TA gets to iteration d, then any
    (correct) algorithm A must get to depth d-1
  • Proof let Rmax be top object returned by TA
  • ?(d) ? t(Rmax) ? ?(d-1)
  • ?There exists D with R at level d-1
  • R ?(x1(d-1), x2 (d-1),xm(d-1) )
  • Where A fails

28
Do wild guesses help?
  • Aggregation function - min, k1
  • Database - 1 2 n n?1 2n?1
  • 1 1 1 1 0 0 0
  • 0 0 0 1 1 1 1
  • L1 1 2 n n?1 2n?1
  • L2 2n?1 n?1 n 1
  • Wild guess access object n?1 and top elements

29
Strict Monotonicity
  • An aggregation function t is strictly monotone if
  • when ?1?i?m xi ? xi
  • Then
  • t(x1, x2,xm) ? t(x1,x2,xm)
  • Examples min, max, avg...

30
Instance Optimality - Wild Guesses
  • Theorem For any strictly monotone t let
  • A be the class of algorithms that
  • correctly find top k answers for every database.
  • D be the class of all databases with distinct
    values in each field.
  • Then TA is instance optimal over A and D
  • Optimality Ratio is c m where
    cmaxcR /cS ,cS /cR

31
Related Work
  • An algorithm similar to TA was discovered
    independently by two other groups
  • Nepal and Ramakrishna
  • G?ntzer, Balke and Kiessling
  • No instance optimality analysis
  • Hence proposed modifications that are not
    instance optimal algorithm

Power of Abstraction?
32
Dealing with the Cost of Random Access
  • In some scenarios random access may be impossible
  • Cannot ask a major search engine for it internal
    score on some document
  • In some scenarios random access may be expensive
  • Cost corresponds to disk access (seq. vs. random)
  • Need algorithms to deal with these scenarios
  • NRA - No Random Access
  • CA - Combined Algorithm

33
No Random Access - NRA
  • March down the lists getting the next object
  • Maintain
  • For any object R with discovered fields
    S??1,..,m?
  • W(R) ? t(x1,x2,,xS,,00)
  • Worst (smallest) value t(R) can obtain
  • B(R) ? t(x1,x2,,xS, xS1,, , xm)
  • Best (largest) value t(R) can obtain

34
maintained information (NRA)
  • Top k list, based on k largest W(R) seen so far
  • Ties broken according to B values
  • Define Mk to be the kth largest W(R) in top k
    list
  • An object R is viable if B(R) ? Mk
  • Stop when there are no viable elements left I.e.
    B(R) ? Mk for all R ?top list
  • Return the top k list

35
Correctness
  • For any monotone t and any database D of objects,
    NRA finds the top k objects.
  • Proof At any point, for all objects t(R)?B(R)
  • Once B(R) ? Ck for all but top list
  • ? no other objects with t(R) ? Ck

36
Optimality
  • Theorem For any monotone t let
  • A be the class of algorithms that
  • correctly find top k answers for every database.
  • make only sequential access
  • D be the class of all databases.
  • Then NRA is instance optimal over A and D
  • Optimality Ratio is m !

37
Implementation of NRA
  • Not so simple - need to update B(R) for all
    existing R when x1,x2,xm changes
  • For specific aggregation functions (min) good
    data structures
  • Open Problem Which aggregation function have
    good data structures?

38
Combined Algorithm CA
  • Can combine TA and NRA
  • Let h cR /cS
  • Maintain information as in NRA
  • For every h sequential accesses
  • Do m random access on an objects from each list.
    Choose top viable for which not all fields are
    known

39
Instance Optimality
  • Instance optimality statement a bit more complex
  • Under certain assumptions (including t min,
    sum) CA is instance optimal with optimality ratio
    2m

40
Further Research
  • Middleware Scenario
  • Better implementations of NRA
  • Is large storage essential
  • Additional useful information in each list?
  • How widely applicable is instance optimality?
  • String Matching, Stable Marriage...
  • Aggregation functions and methods in other
    scenarios
  • Rank Aggregation of Search Engines
  • PNP?

41
More Details
  • See
  • www.wisdom.weizmann.ac.il/naor/PAPERS/middle_agg.
    html
Write a Comment
User Comments (0)
About PowerShow.com