SmartSeer: Using a DHT to Process Continuous Queries Over PeertoPeer Networks

1 / 22
About This Presentation
Title:

SmartSeer: Using a DHT to Process Continuous Queries Over PeertoPeer Networks

Description:

... locate work of interest among the torrent of newly-generated papers will become ... The DN responds with a bit vector specifying the presence of each term ... –

Number of Views:87
Avg rating:3.0/5.0
Slides: 23
Provided by: colin7
Category:

less

Transcript and Presenter's Notes

Title: SmartSeer: Using a DHT to Process Continuous Queries Over PeertoPeer Networks


1
SmartSeer Using a DHT toProcess Continuous
Queries Over Peer-to-Peer Networks
  • Authors
  • Jayanthkumar Kannan
  • Beverly Yang
  • Scott Shenker
  • Puneet Sharma
  • Sujata Banerjee
  • Sujoy Basu
  • Sung-Ju Lee
  • Presentation
  • Colin Jiang

2
Motivation
  • The academic world moves away from physical
    journals to online document repositories
  • The ability to efficiently locate work of
    interest among the torrent of newly-generated
    papers will become increasingly important

3
SmartSeer
  • Distributed preprint repository
  • Supports rich DHT-based continuous (and
    instantaneous) queries
  • Users register personalized continuous queries
    over the CiteSeer database and are alerted
    whenever papers that match their queries are put
    online

4
SmartSeers nonstandard design requirements
  • SmartSeer supports a family of continuous queries
    that is far richer than just simple keyword
    searches
  • SmartSeer is able of running on a loosely
    maintained group of unreliable machines spread
    across multiple organizations

5
Continuous vs Instantaneous Queries
  • Latency is a prime design requirement for
    instantaneous queries but not for continuous
  • In both types, queries are boolean conjunctions
    or disjunctions of search terms
  • An inverted index is used in both cases for
    processing the queries
  • Inverted Index
  • ? Inverted List (per term in the document/
    query)
  • ? IDs of all documents containing a given
    term

6
SmartSeer Architecture
  • Push-based approach selected for minimizing
    cost
  • The partition-by-keyword method was chosen
    among
  • mirroring, partition-by-ID and
    partition-by-keyword
  • Mirroring all documents and continuous
    queries are stored on all nodes ? not
    scalable to SmartSeers requirements
  • Partition-by-ID continuous queries
    (documents) are partitioned
    among the nodes and a new document (query) is
    sent to all nodes ? not scalable
  • Partition-by-keyword distributed index of
    the keywords in the continuous
    queries or documents using DHT and
    this index is partitioned among nodes
    based on the keyword

7
Insertion Functions
  • Query Insertion
  • A conjunctive query is stored at the node
    responsible for its most selective term ts (max
    load-balancing min wasted processing)
  • The key of the query is defined to be the hash of
    ts
  • Document Insertion (Send Query Method)
  • Parsed into tokens (by the document insertion
    node- DN)
  • Tokens are transformed to terms (by DN)
  • All nodes responsible for at least one term in
    the document (query nodes- QN) are contacted via
    a document notification message
  • QNs return the inverted list of all queries
    registered on that keyword
  • DN does the matching and notifies the users

8
Send Query Bandwidth Consumption
  • Instant Queries required bandwidth
  • ß N I Wq Wd
  • Continuous Queries required bandwidth
  • Size Q(r) of the inverted list of queries
  • Q(r) C FZ(r,a, Wq)
  • Required bandwidth for the retrieval of Wd
    inverted lists Wd (Sr Z(r,a) Q(r))
  • Bandwidth required to handle C continuous queries
    with documents arrival rate R per second
  • ß RC F Wd

9
Send Query Alternatives
  • Send Document
  • When a new document is inserted, the entire
    document is sent to all the QNs storing the
    inverted lists for keywords in the document
  • Each QN does the matching and notifies the users
  • Term Dialogue
  • The QN asks the DN about the presence of a set of
    keywords
  • The DN responds with a bit vector specifying the
    presence of each term
  • Shipping keywords can be much cheaper than
    shipping entire queries
  • Bloom Filter
  • The DN sends a bloom filter over all terms in the
    document to each of the QNs
  • QNs can discard queries that have a term
    corresponding to a 0 in the bloom filter
  • QNs may prune the set of queries stored in the
    inverted list to a smaller set

10
Alternatives Bandwidth Consumption
  • Send Document
  • SD d , d ? number of unique terms in the
    document
  • Term Dialogue
  • TD q , q ? number o unique terms across the
    queries
  • Bloom Filter
  • BF f(d) d Q
  • Q Q0 S

11
Batch Notification
  • Useful for median-scale systems
  • Overlap in the nodes to which terms hash
  • Use a clustered approach
  • The DN will first find the QN responsible for
    each term
  • The DN will send one document notification
    message to each unique node, along with the terms
    in the document that each node is responsible for

12
Evaluation (1/ 5)
  • Written in Java
  • Use of the libraries exported by Bamboo and
    OpenDHT
  • SmartSeer runs over two document sets CiteSeer
    and Trec
  • Workload of synthetically generated queries

13
Evaluation (2/ 5)
  • Basic Comparison

14
Evaluation (3/ 5)
  • Query Selectivity

15
Evaluation (4/ 5)
  • Distribution of Terms

16
Evaluation (5/ 5)
  • Batch Notification

17
Complex Queries
  • Boolean Queries
  • Negated Predicates SmartSeer registers queries
    with negated predicates on its non-negated
    predicates
  • OR Queries These queries are registered on every
    term ? duplicated processing possibility ?
    rewrite the OR query to multiple queries such
    that only one of them is matched
  • Boolean Queries decomposed to their disjunctive
    normal form ? sub-queries are joint via ORs and
    each sub-query is an AND query
  • Nested Queries
  • Nested Instantaneous Queries
  • Execute first the sub-query
  • ? retrieve a list of documents that satisfy this
    sub-query
  • ? extract the appropriate attributes
  • ? create on OR query which represents
    the translated version of the
    original sub-query
  • ? execute this version
  • ? retrieve a list of documents
    and use it in processing the original
    query
  • Nested Continuous Queries
  • The sub-query is translated as above
  • ? both the original and the translated
    versions are registered (the
    translated for efficiency and the original for
    correctness)
  • Negated terms are not allowed

18
Limitations on Expressiveness
  • Similarity queries are not supported
  • Instead, a query is registered on some keywords
    and when a new document is inserted, containing
    these keywords, we can check whether the new
    document is above a certain threshold of
    similarity with the query.
  • Restrictions on the structure of the queries
  • Queries with difficult (negation, range)
    predicates are not supported

19
SmartSeers Deployment
  • 20 machines on Planetlab
  • Average TCP throughput 10-15 Mbps
  • A query corpus of about hundred thousand queries
    and a document corpus of 1000 documents
  • Most inverted lists have very few queries
  • The Send Document, Term Dialogue and Bloom Filter
    approaches can support 66,886, 78, 728 and 81,
    430 document insertions per day
  • The throughput of the system was about 80,000 new
    documents per day

20
discussion
21
Bamboo DHT
  • Bamboo DHT - A distributed hash table, or DHT, is
    a building block for peer-to-peer applications.
    At the most basic level, it allows a group of
    distributed hosts to collectively manage a
    mapping from keys to data values, without any
    fixed hierarchy, and with very little human
    assistance. This building block can then be used
    to ease the implementation of a diverse variety
    of peer-to-peer applications such as file sharing
    services, DNS replacements, web caches, etc.

22
Bamboo DHT(cont.)
  • Bamboo is a either based on Pastry, a
    re-engineering of the Pastry protocols, or an
    entirely new DHT, depending on how you want to
    look at it. The term geometry is used to refer to
    the pattern of neighbor links in a DHT,
    independent of the routing algorithms or neighbor
    management algorithms used. Bamboo uses the
    Pastry geometry. It does not, however, use the
    same joining or neighbor management algorithms.
    Compared to Pastry, the algorithms are more
    incremental, a difference that allows Bamboo to
    better withstand large membership changes in the
    DHT as well as continuous churn in membership,
    especially in bandwidth-limited environments
Write a Comment
User Comments (0)
About PowerShow.com