FullText Search in P2P Networks - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

FullText Search in P2P Networks

Description:

Full-text search is normally solved with inverted indexes ... Implement wiki and source code management with full-text search for Scenario B ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 18

Provided by: christ144

Category:

more less

Transcript and Presenter's Notes

Title: FullText Search in P2P Networks

1
Full-Text Search in P2P Networks

Christof Leng
Databases and Distributed Systems Group
TU Darmstadt

2
Content

Short Intro to full-text search
Full-Text search on DHTs
Performance Comparison
Conclusion / Outlook

3
What is full-text search?

Searching for documents containing all of a list
of specified words
Search for QuaP2P ? Darmstadt ? Research
Very common operation
Google
Filesharing
Wikis
Source Code
Document / Knowledge Management
Can be extended to phrase search
Search for TU Darmstadt ? Christof Leng

4
Inverted Index

Full-text search is normally solved with inverted
indexes
Query result is intersection of all searched word
entries
Stemming can reduce the number of word entries

doc1 New P2P system could provide speed
increase.
?
doc2 Similarity searches accelerate P2P
downloads by 30-70 percent.
?
doc3 I fail to see how this will make downloads
faster.
?
5
Overlay Types and Full-Text Search
Inverted index on central server
Inverted index on each (super-)node
Distributed inverted index
? Challenge
6
Naïve Approach

Map inverted index to DHT
Key Lookup for every word
Intersect result lists at client
Pro
Simple
Short latency
Con
Result lists may be extremely large!
Result list sizes may vary extremely!

Darmstadt
QuaP2P
Research
Search for QuaP2P ? Darmstadt ? Research
7
Zipf Distributions in Natural Text

Some words are extremely common
Most words are extremely uncommon
Largest word frequency is proportional to number
of distinct words
? Avoid transfering result lists before
intersection!

Word Occurences
Rank
8
Intersecting on the way

Query least common word first
Forward result list to next word
Intersect on the way
Pro
Reduces traffic
Con
High latency
Knowledge about word frequencies required
Search for the and who (7.2 and 2.4 billion
hits on Google each)

Darmstadt
QuaP2P
Research
Search for QuaP2P ? Darmstadt ? Research
9
Using Bloom Filters

Bloom Filters reduce result list size
Forward Bloom Filters and return result list
recursively
Pro
Reduces traffic even more (up to factor 50x)
Con
Even higher latency
Getting complicated

Darmstadt
QuaP2P
Research
Search for QuaP2P ? Darmstadt ? Research
10
Zipf Distributions in Query Terms, too

Query popularity obeys Zipf Law (déjà vu!)
This puts high load on nodes with the most
popular keys
Even worse, this load scales linearly with the
network size and user activity
The responsible nodes are randomly assigned
(could be a modem user)
? Hotspots will occur

11
Caching and Precomputation

Caching
Keep lists received for intersection
Keep answers to popular queries
Traffic reduction 38
But How to ensure coherence?
Precomputation
Inverted index for pairs or tupels of words
Only feasible for the most popular words
(but most effective there anyway)
Traffic reduction 50

12
Further Optimizations

Compression of result lists
Adaptive Set Intersection
Gap Compression
Clustering of keys
Incremental Results
Do not return all results at once
Should be used in conjunction with ranking
algorithm

13
Comparison of different approaches

Yang et al compared
DHT with Bloom Filters
Supernode with exhaustive flooding
Unstructured Random Walk w/o replication
Network size 1000
Random data set from WWW
All approaches have strengths

14
Feasibility of P2P Web Search Engine

Li et al calculated the bandwidth usage of a
P2P-based web search engine
3 billion documents (10KB each)
60,000 peers
Basic DHT was 100x worse than basic Gnutella
DHT Optimizations (e.g. Bloom Filters) made it
competitive
No index creation or maintenance cost included
(60TB)
No replica maintenance cost included

15
Conclusion

Distributed Inverted Indexes are challenging
Implementation requires a lot of tricks
Performance is not outstanding
No comparison to state-of-the-art unstructured
systems available
Maybe even more tricks from information retrieval
research will help
Modeling the correct workload is really important
for system design

16
Outlook

Examine robustness of full-text search under Zipf
query workloads
Implement DHT full-text search in simulator
Compare state-of-the-art unstructured and
structured full-text search overlays
Improve consistency and coherence in DHT
full-text search systems
Implement wiki and source code management with
full-text search for Scenario B
Phrase search is even more challenging

17
Recommended Reading

Performance Comparison
Li et al. On the Feasibility of Peer-to-Peer Web
Indexing and Search. IPTPS 2003.
Yang et al. Performance of Full Text Search in
Structured and Unstructured Peer-to-Peer Systems.
INFOCOM 2006.
DHT Full-Text Search
P. Reynolds and A. Vahdat. Efficient Peer-to-Peer
Keyword Searching. IMC 2003.
O. Gnawali. A Keyword Set Search System for
Peer-to-Peer Networks. Msc. Thesis, MIT, 2002.
Workload Modeling
Breslau et al. Web Caching and Zipf-like
Distributions Evidence and Implications. INFOCOM
1999.
Gummadi et al. Measurement, Modeling and Analysis
of a Peer-to-Peer File-Sharing Workload. SOSP
2003.