Title: Keyword Search on Spatial Databases
1Keyword Search on Spatial Databases
- Ian De Felipe
- Vagelis Hristidis
- Naphtali Rishe
- School of Computing and Information
SciencesFlorida International UniversityMiami,
FL
2Roadmap
- Motivation - Problem Definition
- Baseline Methods
- IR2-Tree and Search Algorithms
- Experiments
- Related Work
- Conclusions
3Roadmap
- Motivation - Problem Definition
- Baseline Methods
- IR2-Tree and Search Algorithms
- Experiments
- Related Work
- Conclusions
4Motivation
- Application require a combination of spatial and
keyword search. - E.g., online yellow pages allow users to specify
address and set of keywords - Efficient algorithms exists to tackle separately
- Spatial search Nearest Neighbor (NN)
- Keyword search
5Problem Definition
- A spatial keyword query consists of a query area
and a set of keywords. - The answer is list of objects ranked according to
combination of distance to query area and
relevance to query keywords. - A variant is distance-first spatial keyword
query, where objects are ranked by distance and
keywords are applied as conjunctive filter. - Distance-first top-k spatial keyword query
returns k top object only. - Focus on this variant in presentation.
Generalization presented in paper.
6Example Distance-First Spatial Keyword Query
- Find nearest hotels to point 30.5, 100.0 that
contain keywords internet and pool.
7Roadmap
- Motivation - Problem Definition
- Baseline Methods
- IR2-Tree and Search Algorithms
- Experiments
- Related Work
- Conclusions
8Nearest Neighbor Queries First Baseline
Algorithm
- Many proposed algorithms.
- Hjaltason and Samet 99 Incremental NN
- Appropriate navigation of R-Tree
- R-Tree Baseline Execute Incremental NN and for
each output object check if it contains keywords
9Example Execution of the R-Tree Baseline
algorithm on Distance-First Top-2 Spatial Keyword
Query 30.5, 100.0 with keyword internet and
pool
Root Node N1
-33.2,-122.2 47.3,-70.4
-41.1,-0.5 51.3,174.4
Node N2
Node N3
40.4,-122.2 47.3,-73.5
-33.2,-80.1 25.4,-70.4
-41.1,139.4 35.5,174.4
39.5,-0.5 51.3,116.2
Node N4
Node N5
Node N6
Node N7
47.3,-122.2 47.3,-122.2
40.4,-73.5 40.4,-73.5
-33.2,-70.4 -33.2,-70.4
25.4,-80.1 25.4,-80.1
-41.1,174.4 -41.1,174.4
35.5,139.4 35.5,139.4
51.3,-0.5 51.3,-0.5
39.5,116.2 39.5,116.2
Pointer to H2
Pointer to H6
Pointer to H7
Pointer to H1
Pointer to H8
Pointer to H3
Pointer to H5
Pointer to H4
Enqueue N1
Dequeue N1
Enqueue N2 and N3
Dequeue N3
Enqueue N6 and N7
Dequeue N7
Enqueue H5 and H4
Dequeue H4 H4 does not satisfy keywors, hence it
is discarded
If we continue, objects H3, H5, H8, H6, H1, H7,
H2 will be the results. Only H7, H2 are output
since they contain internet and pool
Priority Queue
N1, 0.0
N3, 0.0
N2, 170.4
N7, 9.0
N6, 39.4
H5, 102.6
H4, 18.5
10Keyword Search Queries Second Baseline Algorithm
- Keyword search on documents well-studied in IR.
Two major methods - Inverted index
- Signature files Faloutsos and Christodoulakis
84 - Inverted Index Only (IIO) Baseline
- For each keyword find spatial objects that
contain it - Intersect them
- For each object compute distance to query point
- Sort and return to user
11Example Execution of IIO Baseline algorithm on
Distance-First Top-2 Spatial Keyword Query 30.5,
100.0 with keyword internet and pool
H2
H6
H1
H7
Results for internet
H3
H4
H2
H7
H8
Results for pool
H2
H7
Intersection of results
Results list
H2, 222.8
H7, 181.9
Execute the Inverted Index for keyword internet
Execute the Inverted Index for keyword pool
Intersect the two result sets
Get the coordinates for H2, calculate distance,
and add to result list
Get the coordinates for H7, calculate distance,
and add to result list
Sort, and that is our top-2 results
12Roadmap
- Motivation - Problem Definition
- Baseline Methods
- IR2-Tree and Search Algorithms
- Experiments
- Related Work
- Conclusions
13Information Retrieval R-Tree (IR2-Tree)
- Combination of R-Tree and Signature Files.
- Each node contains a rectangle and a signature.
- The signature of a node is the superimposition
(OR-ing) of all the signatures of its entries. - Bottom-up construction.
- Multi-level IR2-Tree (MIR2-Tree)
- Uses different signature lengths for different
levels - More complex update operations
- Fewer False Positives
14IR2-Tree Search Algorithm
- Calculate query signature.
- Navigate IR2-Tree similarly to Incremental NN
algorithm. - Discard nodes that do not satisfy query
signature. - Check returned objects for false positives.
15Example Execution of the IR2-Tree Algorithm on
Distance-First Top-2 Spatial Keyword Query 30.5,
100.0 with keyword internet and pool
Root Node N1
11111111 10110111
11111101 11011011
-33.2,-122.2 47.3,-70.4
-41.1,-0.5 51.3,174.4
Node N2
Node N3
10001111 00100011
11111111 10010110
10011001 01001011
01101101 10010011
40.4,-122.2 47.3,-73.5
-33.2,-80.1 25.4,-70.4
-41.1,139.4 35.5,174.4
39.5,-0.5 51.3,116.2
Node N4
Node N5
Node N6
Node N7
10001011 00000010
00001110 00100011
10000011 00010110
01111110 10000010
00011001 01001011
10011001 00001010
01100101 10000011
00001001 10010010
47.3,-122.2 47.3,-122.2
40.4,-73.5 40.4,-73.5
-33.2,-70.4 -33.2,-70.4
25.4,-80.1 25.4,-80.1
-41.1,174.4 -41.1,174.4
35.5,139.4 35.5,139.4
51.3,-0.5 51.3,-0.5
39.5,116.2 39.5,116.2
Pointer to H2
Pointer to H6
Pointer to H7
Pointer to H1
Pointer to H8
Pointer to H3
Pointer to H5
Pointer to H4
First we note that the signature for internet
is
00000010 00000000
Enqueue N1
Dequeue N1
Enqueue N2 note that N3 is pruned
Dequeue N2
Enqueue N4 and N5
Dequeue N5
Enqueue H7 note that H1 is pruned
Dequeue N4
Enqueue H2 note that H6 is pruned
Dequeue H7, check if false positive, our first
result
And the signature for pool is
00000001 00000000
Dequeue H2, check if false positive, our second
result
Therefore the query signature is
00000011 00000000
Priority Queue
N1, 0.0
N2, 170.4
N5, 170.5
N4, 173.8
H7, 181.9
H2, 222.8
16Roadmap
- Motivation - Problem Definition
- Baseline Methods
- IR2-Tree and Search Algorithms
- Experiments
- Related Work
- Conclusions
17Experiments
- Athlon 64 3400 (NewCastle) with 2GB of RAM and
74GB 10,000RPM drive - Block size is 4,096 KB
- Two real datasets provided by High Performance
Database Research Center (http//hpdrc.fiu.edu/)
- Only results on Hotels dataset are presented
18Varying k
- 2 keywords
- Signature length 189 bytes (longer at the top
levels of the MIR2-Tree)
19Varying keywords
- k10
- Signature length 189 bytes (longer at the top
levels of the MIR2-Tree)
20Varying signature length
- k10
- 2 keywords
- Tradeoff nodes in tree (based on entries per
block) vs. false positives
21Index Size (MB)
22Roadmap
- Motivation - Problem Definition
- Baseline Methods
- IR2-Tree and Search Algorithms
- Experiments
- Related Work
- Conclusions
23Related Work
- Nearest Neighbor Queries
- N. Roussopoulos, S. Kelley, and F. Vincent.
Nearest neighbor queries. SIGMOD, 1995. - G.R. Hjaltason and H. Samet. Distance browsing in
spatial databases. TODS, Vol. 24, No. 2, 1999 - Combination of spatial and keyword queries
- D. Park, H. Kim An Enhanced Technique for
k-Nearest Neighbor Queries with Non-Spatial
Selection Predicates. In Multimedia Tools and
Applications archive, Volume 19 , Issue 1
(January 2003), Pages 79 103 - Y. Zhou, X. Xie, C. Wang, Y. Gong, and W. Ma.
Hybrid index structures for location-based web
search. ACM CIKM 2005 - Signature Files
- Christos Faloutsos, Stavros Christodoulakis
Signature Files An Access Method for Documents
and Its Analytical Performance Evaluation. In ACM
Trans. Inf. Syst. 2(4) 267-288(1984) - Dik Lun Lee, Young Man Kim, Gaurav Patel
Efficient Signature File Methods for Text
Retrieval. Pages 423-435. TKDE Vol 7, Number 3,
June 1995
24Roadmap
- Motivation - Problem Definition
- Baseline Methods
- IR2-Tree and Search Algorithms
- Experiments
- Related Work
- Conclusions
25Conclusions
- Framework for top-k spatial keyword search
queries and variants. - Propose index combining R-Tree with signature
files. - Algorithm for top-k spatial keyword search.
- Comprehensive study and experimentation.
26Thank You!