Title: FullText Search in P2P Networks
1Full-Text Search in P2P Networks
- Christof Leng
- Databases and Distributed Systems Group
- TU Darmstadt
2Content
- Short Intro to full-text search
- Full-Text search on DHTs
- Performance Comparison
- Conclusion / Outlook
3What is full-text search?
- Searching for documents containing all of a list
of specified words - Search for QuaP2P ? Darmstadt ? Research
- Very common operation
- Google
- Filesharing
- Wikis
- Source Code
- Document / Knowledge Management
-
- Can be extended to phrase search
- Search for TU Darmstadt ? Christof Leng
4Inverted Index
- Full-text search is normally solved with inverted
indexes - Query result is intersection of all searched word
entries - Stemming can reduce the number of word entries
doc1 New P2P system could provide speed
increase.
?
doc2 Similarity searches accelerate P2P
downloads by 30-70 percent.
?
doc3 I fail to see how this will make downloads
faster.
?
5Overlay Types and Full-Text Search
Inverted index on central server
Inverted index on each (super-)node
Distributed inverted index
? Challenge
6Naïve Approach
- Map inverted index to DHT
- Key Lookup for every word
- Intersect result lists at client
- Pro
- Simple
- Short latency
- Con
- Result lists may be extremely large!
- Result list sizes may vary extremely!
Darmstadt
QuaP2P
Research
Search for QuaP2P ? Darmstadt ? Research
7Zipf Distributions in Natural Text
- Some words are extremely common
- Most words are extremely uncommon
- Largest word frequency is proportional to number
of distinct words - ? Avoid transfering result lists before
intersection!
Word Occurences
Rank
8Intersecting on the way
- Query least common word first
- Forward result list to next word
- Intersect on the way
- Pro
- Reduces traffic
- Con
- High latency
- Knowledge about word frequencies required
- Search for the and who (7.2 and 2.4 billion
hits on Google each)
Darmstadt
QuaP2P
Research
Search for QuaP2P ? Darmstadt ? Research
9Using Bloom Filters
- Bloom Filters reduce result list size
- Forward Bloom Filters and return result list
recursively - Pro
- Reduces traffic even more (up to factor 50x)
- Con
- Even higher latency
- Getting complicated
Darmstadt
QuaP2P
Research
Search for QuaP2P ? Darmstadt ? Research
10Zipf Distributions in Query Terms, too
- Query popularity obeys Zipf Law (déjà vu!)
- This puts high load on nodes with the most
popular keys - Even worse, this load scales linearly with the
network size and user activity - The responsible nodes are randomly assigned
(could be a modem user) - ? Hotspots will occur
11Caching and Precomputation
- Caching
- Keep lists received for intersection
- Keep answers to popular queries
- Traffic reduction 38
- But How to ensure coherence?
- Precomputation
- Inverted index for pairs or tupels of words
- Only feasible for the most popular words
- (but most effective there anyway)
- Traffic reduction 50
12Further Optimizations
- Compression of result lists
- Adaptive Set Intersection
- Gap Compression
- Clustering of keys
- Incremental Results
- Do not return all results at once
- Should be used in conjunction with ranking
algorithm
13Comparison of different approaches
- Yang et al compared
- DHT with Bloom Filters
- Supernode with exhaustive flooding
- Unstructured Random Walk w/o replication
- Network size 1000
- Random data set from WWW
- All approaches have strengths
14Feasibility of P2P Web Search Engine
- Li et al calculated the bandwidth usage of a
P2P-based web search engine - 3 billion documents (10KB each)
- 60,000 peers
- Basic DHT was 100x worse than basic Gnutella
- DHT Optimizations (e.g. Bloom Filters) made it
competitive - No index creation or maintenance cost included
(60TB) - No replica maintenance cost included
15Conclusion
- Distributed Inverted Indexes are challenging
- Implementation requires a lot of tricks
- Performance is not outstanding
- No comparison to state-of-the-art unstructured
systems available - Maybe even more tricks from information retrieval
research will help - Modeling the correct workload is really important
for system design
16Outlook
- Examine robustness of full-text search under Zipf
query workloads - Implement DHT full-text search in simulator
- Compare state-of-the-art unstructured and
structured full-text search overlays - Improve consistency and coherence in DHT
full-text search systems - Implement wiki and source code management with
full-text search for Scenario B - Phrase search is even more challenging
17Recommended Reading
- Performance Comparison
- Li et al. On the Feasibility of Peer-to-Peer Web
Indexing and Search. IPTPS 2003. - Yang et al. Performance of Full Text Search in
Structured and Unstructured Peer-to-Peer Systems.
INFOCOM 2006. - DHT Full-Text Search
- P. Reynolds and A. Vahdat. Efficient Peer-to-Peer
Keyword Searching. IMC 2003. - O. Gnawali. A Keyword Set Search System for
Peer-to-Peer Networks. Msc. Thesis, MIT, 2002. - Workload Modeling
- Breslau et al. Web Caching and Zipf-like
Distributions Evidence and Implications. INFOCOM
1999. - Gummadi et al. Measurement, Modeling and Analysis
of a Peer-to-Peer File-Sharing Workload. SOSP
2003.