Reverse Nearest Neighbor Aggregates - PowerPoint PPT Presentation

About This Presentation
Title:

Reverse Nearest Neighbor Aggregates

Description:

1 Assumption: Servers are sorted l1 ln Counter number of clients for server i: C(i) - Lk [li, li+1) at the right side of server i C(0) at left side ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 56
Provided by: Ale8182
Category:

less

Transcript and Presenter's Notes

Title: Reverse Nearest Neighbor Aggregates


1
Reverse Nearest Neighbor Aggregates
Over Data Streams
Flip Korn, S. Muthukrishnan and Divesh
Srivastava.
VLDB 2002
Alexander Izbinsky
1
2
Background
  • RNN(q) returns a set of data points that have
    the query point q as the nearest neighbor.
  • Advanced database applications
  • fixed wireless telephone access application
    load detection problemcount how many users
    are currently using a specific base station q ?
    if qs load is too heavy ? activating an inactive
    base station to lighten the load of that over
    loaded base station
  • Asymetric Property
  • The Nearest Neighbor Relation is not symmetric,
    the set of points that are closest to a query
    point (i.e., the Nearest Neighbors) differs from
    the set of points that have the query point as
    their Nearest Neighbor (called the Reverse
    Nearest Neighbors)

2
3
Nonsymmetrical Property of RNN Queries
  • NN(q) p NN(p) q
  • If p is the nearest neighbor of q, then q need
    not be the nearest neighbor of p (in this case
    the nearest neighbor of p is r).
  • those efficient NN algorithms cannot directly
    applied to solve the RNN problems. Algorithms for
    RNN problems are needed.
  • A straight forward solution-- check for each
    point whether it has q as its nearest neighbor
    -- not suitable for large data set!

3
4
Two Versions of RNN Problem
  • Bichromatic Version
  • the data points are of two categories, say red
    and blue. The RNN query point q is in one of the
    categories, say blue. So RNN(q) must determine
    the red points which have the query point q as
    the closest blue point.
  • e.g. fixed wireless telephone access application
    clients/red (e.g. call initiation or
    termination)
  • servers/blue (e.g. fixed wireless base stations)
  • Monochromatic Version
  • all points are of the same color is the
    monochromatic version.

4
5
Introduction
  • RNN queries have been studied for finite, stored
    data sets
  • RNN can identify "influence" of a data point on
    the database
  • F. Korn and S. Muthukrishnan, Influence Sets
    Based on Reverse Nearest Neighbor Queries
  • I. Stanoi, M. Riedewald, D., Mirek Riedewald, D.
    Agrawal, A.E. Abbadi, Discovery of influence sets
    in frequently updated databases
  • C. Yang, King-Ip Lin, An index structure for
    efficient reverse nearest neighbor queries

5
6
Determining the Influence Set
  • Finding the set of customers affected by the
    opening of a new store outlet location
  • Notifying the subset of subscribers to a digital
    library who will find a newly added document most
    relevant
  • Finding set of users whose profiles are more
    similar to the new service offering than to any
    other service

The interest is not the exact RNN set, But
aggregates on this set - RNNA !
6
7
RNNA Application 1
Fixed Wireless Telephony Access
  • Fixed Physical Position
  • Defined Coverage Area
  • Calls Arrives in Streams
  • Worst-Case Signal Strength RNN MAXDIST
  • Load on Base Station RNN COUNT
  • Optimization RNNA problems

7
8
RNNA Application 2
Highway Traffic Monitoring
  • Fixed Physical Position
  • Detect vehicles, estimate speed and length
  • User Queries Arrives in Streams
  • Periodic Updates of Closest Sensor
  • Load on Sensor RNN COUNT
  • Accuracy of Information RNN MAXDIST
  • Optimization RNNA problems

8
9
RNNA Computations
  • Max-RNNA Given K servers, return the maximum
    RNNA over all clients to any of the servers
  • List-RNNA Given K servers, return the RNNA
    over all clients to each of the servers
  • Opt-RNNA Find a set of at most K servers for
    which their RNNAs are below a given threshold

Exact computation is not possible
9
10
RNNA Approximations
  • Max-RNN-Count
  • Insertion and Deletion 3-approximation
  • Insertion only (1?) -approximation
  • Max-RNN-MAXDIST
  • (1?) -approximation
  • List-RNN-COUNT List-RNN-MAXDIST
  • Lower- Upper-bound as function of the true
    counts
  • Opt-RNN-COUNT
  • 8-approximation
  • Opt-RNN-MAXDIST
  • (1?) approximation

Space near-linear in the number of available
servers
10
11
Related Works
  • No previous works for RNNA over Data Streams
  • Algorithms over Data Streams
  • Algorithms for computing RNN over a
  • conventional DB

11
12
Algorithms over Data Streams
  • Space requirements of Selection and Sorting as a
    function of the number of passes over the data
  • J. I. Munro and M. S. Paterson. Selection and
    Sorting with Limited Storage
  • Formalization of the Data Stream Model
  • A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M.J.
    Strauss. Surfing Wavelets on Streams One-Pass
    Summaries for Approximate Aggregate Queries and
    M. R. Henzinger, P. Raghavan, S. Rajagopalan.
    Computing on data streams

12
13
Algorithms over Data Streams
  • Computing the approximate median and other
    quantiles in a single pass over data set
  • R. Agrawal, A. Swami, A One-Pass Space-Efficient
    Algorithm for Finding Quantiles
  • G.S. Manku, S. Rajagopalan, B.G. Lindsay.
    Approximate Medians and other Quantiles in One
    Pass and with Limited Memory
  • G.S. Manku, S. Rajagopalan, B.G. Lindsay. Random
    Sampling Techniques for Space Efficient Online
    Computation of Order Statistics of Large
    Datasets
  • M. Greenwald and S. Khanna. Space- Efficient
    Online Computation of Quantile Summaries

13
14
Algorithms over Data Streams
  • Computing the approximate online quantiles with
    probabilistic guaranties over data stream
  • A.C. Gilbert, Y.Kotidis, S. Muthukrishnan, M.J.
    Strauss. How to Summarize the Universe Dynamic
    Maintenance of Quantiles
  • Histogram construction over data stream
  • A.C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S.
    Muthukrishnan, M.J. Strauss. Fast, Small-Space
    Algorithms for Approximate Histogram Maintenance

14
15
Algorithms over Data Streams
  • Maintaining summary structures for maintaining
    approximate aggregates over data stream
  • A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M.J.
    Strauss. Surfing Wavelets on Streams One-Pass
    Summaries for Approximate Aggregate Queries and
    M. R. Henzinger, P. Raghavan, S. Rajagopalan.
    Computing on data streams
  • J. Gehrke, F. Korn, and D. Srivastava. On
    computing correlated aggregates over continual
    data streams

15
16
Algorithms over Data Streams Mining Data Stream
  • Construction of decision trees
  • P. Domingos, G. Hulten. Mining High-Speed Data
    Streams
  • J. Gehrke, V. Ganti, R. Ramakrishnan, W.-Y. Loh.
    BOAT Optimistic Decision Tree Construction
  • Association rules
  • C. Hidber. Online Association Rule Mining
  • Similarity matching
  • G. Cormode, M. Datar, P. Indyk, S.
    Muthukrishnan. Comparing Data Streams Using
    Hamming Norms

16
17
Algorithms over Data Streams Mining Data Stream
  • Clustering algorithms (k-median clustering
    problem)
  • M. Charikar, C. Chekuri, T. Feder, R. Motwani.
    Incremental Clustering and Dynamic Information
    Retrieval
  • S. Guha, N. Mishra, R. Motwani, L. O'Callaghan.
    Clustering Data Streams

17
18
Algorithms over Data Streams Dynamic Maintenance
  • Lp norms
  • P. Indyk. Stable Distributions, Pseudorandom
    Generators, Embeddings and Data Stream
    Computation
  • Hamming norms
  • G. Cormode, M. Datar, P. Indyk, S.
    Muthukrishnan. Comparing Data Streams Using
    Hamming Norms
  • Quantiles
  • A.C. Gilbert, Y.Kotidis, S. Muthukrishnan, M.J.
    Strauss. How to Summarize the Universe Dynamic
    Maintenance of Quantiles
  • Sliding window
  • M. Datar. Maintaining Stream Statistics over
    Sliding Windows

18
19
Algorithms for computing RNN over a conventional
DB
  • Study of RNN in data bases
  • F. Korn and S. Muthukrishnan, Influence Sets
    Based on Reverse Nearest Neighbor Queries
  • Efficient access methods for indexing RNN
  • I. Stanoi, M. Riedewald, D., Mirek Riedewald, D.
    Agrawal, A.E. Abbadi, Discovery of influence sets
    in frequently updated databases
  • C. Yang, King-Ip Lin, An index structure for
    efficient reverse nearest neighbor queries

19
20
Problem Definition
Collection of n available servers (not necessary
active) li location of server i Clients arrive
and depart Lj location of client j RNN of
server i is the set of all clients that have i as
their NN server
20
21
Instances of Aggregates
  • RNN-COUNT(i) number of clients currently in the
    system for which i is the NN LOAD for active
    servers
  • RNN-MAXDIST(i ) largest distance to a client
    that has i as its NN QUALITY for active
    servers
  • Streams of clients are large cant be stored in
    memory computing approximate RNNA values

21
22
Focus of the Problem
  • Max-RNNA Given K active servers, return the
    maximum RNNA over all clients to their closest
    active server Worst-case Load or Quality
  • List-RNNA Given K active servers, return a
    list of the RNNA over all clients to each of the
    K active servers - Maximum Load or Worst-case
    Quality
  • Opt-RNNA Find a set of at most K servers from
    the available ones to be active, for which their
    RNNAs are below a given threshold Optimization

22
23
Algorithm
Assumption Servers are on as straight line
Counters for servers i, j and client k CLij -gt
Lk?li, (lilj)/2) CRij -gt Lk?((lilj)/2, lj
23
24
Algorithm for RNN-COUNT ( i )
The algorithm Let l be the closest active server
from the left of i and r from the
right. RNN-COUNT(i) CLil CRir
Require O(n2) space O(n2) updates
We want space near-linear and less updates ?
Approximation is needed
24
25
Data Structure
Definitions s1,.. sk are the K servers
designated to be active Assumption Servers are
sorted l1? ?ln
Counter number of clients for server i C(i) -gt
Lk?li, li1) at the right side of server
i C(0) at left side of server 1
Require O(n) space O(log n) updates (look for
wanted server)
25
26
Answering Queries
Max-RNNA (s1,.. sk)
Max-RNNA(s1,.. sk) maxi RNN-COUNT(si)
26
27
Example Max-RNNA (s1,.. sk)
27
28
Max-RNNA (s1,.. sk)
28
29
Answering Queries
List-RNNA (s1,.. sk)
Mi for each si
The Proof is similar to previous theorem
29
30
Answering Queries
Opt-RNNA
  • Greedy Algorithm finds
  • Minimal Number of active servers K
  • maxi RNN-COUNT(si)?C

30
31
Answering Queries
Opt-RNNA
31
32
Opt-RNNA
32
33
Opt-RNNA "Dual" Problem
Minimize maxi RNN-COUNT(si)
Given upper bound on number of servers K
  • Algorithm
  • Choose different values of C
  • Run Greedy Algorithm of Opt-RNNA
  • Repeat until solve with number of servers K?K

33
34
Insert-Only Clients
Data Structure
Assumption Servers are sorted l1? ?ln Counter
number of clients for server i C(i) -gt Lk?li,
li1) at the right side of server i C(0) at
left side of server 1
Count Partitioning
Maintain l-quantiles (Greenwald Khanna) ci1cil
number of clients lying in li, Lcik Within
(1??)kC(i)/l, where 1?k ?l Require O(logC(i)/?)
space
34
35
Answering Queries
Max-RNNA (s1,.. sk)
Max-RNNA(s1,.. sk) maxi RNN-COUNT(si)
35
36
Max-RNNA (s1,.. sk)
36
37
Insert-Only Clients
List-RNNA (s1,.. sk)
Implementation in the same way
Opt-RNNA
Maintenance of data structure for deletion ?
37
38
Algorithm for RNN-MAXDIST ( i )
The algorithm Histogram based on space
partitioning Assumption Servers are sorted l1?
?ln Exponential sized buckets Domain size U,
such that U min(Lj,li), max(Lj,li) Dividers
between servers i and (i1) gij at distance (1
?)j from li Number of dividers is O(log1 ?
li1-li)
38
39
Data Structure
Counter number of clients between gik and gik1
is gik
  • For updates of client j
  • Find i such that Lj?li, li1)
  • Find k such that Lj?gik , gik1)
  • Update value gik

Require O(n log1 ? U) space O(log1 ? U) updates
39
40
Answering Queries
Max-RNNA (s1,.. sk)
Max-RNNA(s1,.. sk) maxi RNN-MAXDIST(si)
40
41
Max-RNNA (s1,.. sk)
Details of the proof will be given in the future
paper.
41
42
Answering Queries
List-RNNA (s1,.. sk)
DimaxRDi,LDi for each si
The Proof is similar to previous theorem
42
43
Answering Queries
Opt-RNNA
  • Greedy Algorithm with limited backtracking finds
  • Minimal Number of active servers K
  • maxi RNN-MAXDIST(si)?D

43
44
Opt-RNNA
The proof will be given in the future paper.
44
45
Opt-RNNA "Dual" Problem
Minimize maxi RNN-MAXDIST(si)
Given upper bound on number of servers K
  • Algorithm
  • Choose different values of D
  • Run Greedy Algorithm of Opt-RNNA
  • Repeat until solve with number of servers K?K

45
46
Extensions
Nearest Neighbor and Reverse Nearest Neighbor
Queries for Moving Objects R.Benetis,
C.S.Jensen,G.Karciauskas, S.Saltenis Reverse
Nearest Neighbor Queries for Dynamic
Databases SHOU Yu Tao
Assumption the clients are on the same axis as
the servers
46
47
Summary of Results
47
48
Experiments
The following aspects were tested
Experimental data CALIFORNIA latitude of 63k
buildings in California, uniform and binomial
distributions
48
49
Average Error of List-RNN-Count Test AVG i (
Ci/Ci )
49
50
Average Error of List-RNN-Maxdist Test AVG i (
Di/Di )
50
51
Maximum Error of Max-RNN-Count Test ( max Ci/max
Ci )
51
52
Maximum Error of List-RNN-Maxdist Test ( max
Di/max Di )
52
53
Conclusions
  • RNNA supports computations based on geographical
    distances or vector-space similarity between
    servers and clients
  • Applications of RNNA
  • Classical facility location
  • Emerging fixed wireless telephony access and
    sensor-based
  • traffic monitoring
  • Data of RNNA arrives in streams
  • RNNA performs online computations

53
54
Conclusions
  • We study three problems
  • Max-RNNA
  • List-RNNA
  • Opt-RNNA
  • Two aggregates
  • COUNT
  • MAXDIST
  • Approximate algorithms with near-linear space
    usage

54
55
Questions and Answers
Any Questions?
?
Write a Comment
User Comments (0)
About PowerShow.com